DBA Blogs

Migrating Your Enterprise Applications To Amazon Web Services (AWS)

Pythian Group - Thu, 2016-05-12 15:34


Many of the enterprise clients we work with are looking at Amazon Web Services (AWS) to support their cloud strategies and reap the benefits of the public cloud: lower costs, higher scalability, greater availability of computing resources.


AWS is well known for its cutting edge service offerings, which are always growing and evolving—making them an easy choice to recommend to our enterprise clients. When used to provide infrastructure as a service (IaaS), it simplifies hardware provisioning and management and makes it easy to allocate compute, memory and storage capacity. In a platform as a service (PaaS) situation, it can streamline operations through automated backups and patching, and can offer the convenience of pay-as-you-go billing.


The challenge for many organizations migrating to the cloud comes when they start to move their enterprise applications. That’s because these applications often rely on highly customized infrastructure with tightly coupled components. They also tend to exist in silos. Some careful planning is needed to make sure the new cloud environment delivers the outperforms the legacy on-premises one in terms of scalability, reliability and performance. Whenever desired by the customer, we will often recommend and apply a number of optimizations during the migration process, such as decoupling of components, stability and performance tune-ups as well as improved operational  visibility and platform automation.


When we migrated Harvard Business Publishing to AWS, for example, we tailored the public cloud architecture to meet their specific needs. (Harvard Business Publishing is a subsidiary of Harvard University — based in Boston, with offices in New York City, India, Singapore, Qatar and the United Kingdom. They produce and distribute content across multiple platforms.)


Harvard Business Publishing was running Oracle applications in a data center on three Oracle Database Appliances. With the data center lease coming up for renewal, they asked us to help migrate their Oracle-based production, quality assurance and disaster recovery systems to an AWS cloud. Obviously, these systems were mission critical.


We did thorough capacity planning and developed a reliable, highly-automated and predictable migration approach up front, customized the AWS architecture, and migrated to a large number of virtual machines and Oracle schemas.


In another instance, we migrated CityRealty’s core applications to the cloud from on-premise hardware running Oracle databases. CityRealty is the oldest continuously operating real estate website in New York City — the world’s largest real estate market — and remains the city’s leading site today. We proposed moving three production databases to Amazon Elastic Compute Cloud (EC2), which provides highly scalable compute capacity, as well as upgrading the Oracle databases. We also built a highly efficient standby and recovery solution using Amazon Elastic Block Storage (EBS) snapshots.


In both of these cases, we did the planning up front to ensure the new cloud environment would be ready to receive the clients’ applications and run them without any compromise in performance. We’ve done similar migrations for clients operating e-commerce businesses with revenues of millions and even billions per year. Every situation is a little different: in some cases clients needed to simplify and standardize their mission-critical systems; in others, boost reliability; in others, increase automation or eliminate transaction-hampering latency.


We have the benefit of small, dedicated cloud teams around the world with multiple, advanced AWS certifications to address even the most complex requirements. Our goal always is to ensure the public cloud architecture is ideally suited to the enterprise, and to provide detailed implementation plans and data management solutions for optimal performance. That allows us to implement and manage public, private, and hybrid environments with lower risk and cost for organizations that want to seize the benefits of the cloud.

Find out how Pythian can help you with cloud solutions for AWS  today.

Categories: DBA Blogs

Understanding query slowness after platform change

Bobby Durrett's DBA Blog - Thu, 2016-05-12 14:54

We are moving a production database from 10.2 Oracle on HP-UX 64 bit Itanium to 11.2 Oracle on Linux on 64 bit Intel x86. So, we are upgrading the database software from 10.2 to 11.2. We are also changing endianness from Itanium’s byte order to that of Intel’s x86-64 processors. Also, my tests have shown that the new processors are about twice as fast as the older Itanium CPUs.

Two SQL queries stand out as being a lot slower on the new system although other queries are fine. So, I tried to understand why these particular queries were slower. I will just talk about one query since we saw similar behavior for both. This query has sql_id = aktyyckj710a3.

First I looked at the way the query executed on both systems using a query like this:

select ss.sql_id,
where ss.sql_id = 'aktyyckj710a3'
and ss.snap_id=sn.snap_id
and executions_delta > 0
order by ss.snap_id,ss.sql_id;

It had a single plan on production and averaged a few seconds per execution:

PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average rows processed
--------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ----------------------
      918231698 11-MAY-16 PM              195         1364.80228     609.183405    831.563728                  0                      0                      0          35211.9487             1622.4             6974.40513
      918231698 11-MAY-16 PM              129         555.981481     144.348698    441.670271                  0                      0                      0          8682.84496         646.984496             1810.51938
      918231698 11-MAY-16 PM               39         91.5794872     39.6675128    54.4575897                  0                      0                      0          3055.17949          63.025641             669.153846
      918231698 12-MAY-16 AM               35         178.688971     28.0369429    159.676629                  0                      0                      0          1464.28571              190.8             311.485714
      918231698 12-MAY-16 AM              124         649.370258     194.895944    486.875758                  0                      0                      0           13447.871         652.806452             2930.23387
      918231698 12-MAY-16 AM              168         2174.35909     622.905935    1659.14223                  0                      0             .001303571          38313.1548         2403.28571             8894.42857
      918231698 12-MAY-16 AM              213         3712.60403     1100.01973    2781.68793                  0                      0             .000690141          63878.1362               3951             15026.2066
      918231698 12-MAY-16 PM              221         2374.74486      741.20133    1741.28251                  0                      0             .000045249          44243.8914         2804.66063               10294.81

On the new Linux system the query was taking 10 times as long to run as in the HP system.

PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average rows processed
--------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ----------------------
     2834425987 10-MAY-16 PM               41         39998.8871     1750.66015    38598.1108                  0                      0                      0          50694.1463         11518.0244             49379.4634
     2834425987 10-MAY-16 PM               33         44664.4329     1680.59361    43319.9765                  0                      0                      0          47090.4848         10999.1818             48132.4242
     2834425987 11-MAY-16 AM                8          169.75075      60.615125      111.1715                  0                      0                      0             417.375                 92                2763.25
     2834425987 11-MAY-16 PM               11         14730.9611     314.497455    14507.0803                  0                      0                      0          8456.63636         2175.63636             4914.90909
     2834425987 11-MAY-16 PM                2           1302.774       1301.794             0                  0                      0                      0               78040                  0                  49013
     2834425987 11-MAY-16 PM                1           1185.321       1187.813             0                  0                      0                      0               78040                  0                  49013
     2834425987 11-MAY-16 PM               14         69612.6197     2409.27829     67697.353                  0                      0                      0          45156.8571         11889.1429             45596.7143
     2834425987 11-MAY-16 PM               16         65485.9254     2232.40963    63739.7442                  0                      0                      0          38397.4375         12151.9375             52222.1875
     2834425987 12-MAY-16 AM               61         24361.6303     1445.50141    23088.6067                  0                      0                      0          47224.4426         5331.06557              47581.918
     2834425987 12-MAY-16 AM               86         38596.7262     1790.56574    37139.4262                  0                      0                      0          46023.0349         9762.01163             48870.0465

The query plans were not the same but they were similar. Also, the number of rows in our test cases were more than the average number of rows per run in production but it still didn’t account for all the differences.

We decided to use an outline hint and SQL Profile to force the HP system’s plan on the queries in the Linux system to see if the same plan would run faster.

It was a pain to run the query with bind variables that are dates for my test so I kind of cheated and replaced the bind variables with literals. First I extracted some example values for the variables from the original system:

select * from 
(select distinct
sb.sql_id='aktyyckj710a3' and
order by 

Then I got the plan of the query with the bind variables filled in with the literals from the original HP system. Here is how I got the plan without the SQL query itself:

truncate table plan_table;

explain plan into plan_table for 
-- problem query here with bind variables replaced

set markup html preformat on

select * from table(dbms_xplan.display('PLAN_TABLE',

This plan outputs an outline hint similar to this:

      NO_ACCESS(@"SEL$5DA710D3" "VW_NSO_1"@"SEL$5DA710D3")
      OPT_PARAM('query_rewrite_enabled' 'false')

Now, to force aktyyckj710a3 to run on the new system with the same plan as on the original system I had to run the query on the new system with the outline hint and get the plan hash value for the plan that the query uses.

explain plan into plan_table for 
Plan hash value: 1022624069

So, I compared the two plans and they were the same but the plan hash values were different. 1022624069 on Linux was the same as 918231698. I think that endianness differences caused the plan_hash_value differences for the same plan.

Then we forced the original HP system plan on to the real sql_id using coe_xfr_sql_profile.sql.

-- build script to load profile

@coe_xfr_sql_profile.sql aktyyckj710a3 1022624069

-- run generated script


Sadly, even after forcing the original system’s plan on the new system, the query still ran just as slow. But, at least we were able to remove the plan difference as the source of the problem.

We did notice a high I/O time on the Linux executions. Running AWR reports showed about a 5 millisecond single block read time on Linux and about 1 millisecond on HP. I also graphed this over time using my Python scripts:

Linux db file sequential read (single block read) graph:


HP-UX db file sequential read graph:


So, in general our source HP system was seeing sub millisecond single block reads but our new Linux system was seeing multiple millisecond reads. So, this lead us to look at differences in the storage system. It seems that the original system was on flash or solid state disk and the new one was not. So, we are going to move the new system to SSD and see how that affects the query performance.

Even though this led to a possible hardware issue I thought it was worth sharing the process I took to get there including eliminating differences in the query plan by matching the plan on the original platform.



Our Linux and storage teams moved the new Linux VM to solid state disk and resolved these issues. The query ran about 10 times faster than it did on the original system after moving Linux to SSD.

HP Version:

------------------------- ---------------- ------------------ PM                        245         5341.99923 PM                        250         1280.99632 PM                        341         3976.65855 PM                        125         2619.58894


------------------------- ---------------- ------------------
16-MAY-16 AM              162         191.314809
16-MAY-16 AM              342         746.313994
16-MAY-16 AM              258         461.641705
16-MAY-16 PM              280         478.601618

The single block read time is well under 1 millisecond now that 
the Linux database is on SSD.

END_INTERVAL_TIME          number of waits ave microseconds 
-------------------------- --------------- ---------------- 
15-MAY-16 PM           544681       515.978687
16-MAY-16 AM           828539       502.911935
16-MAY-16 AM           518322       1356.92377
16-MAY-16 AM            10698       637.953543
16-MAY-16 AM              193       628.170984
16-MAY-16 AM              112        1799.3125
16-MAY-16 AM             1680       318.792262
16-MAY-16 AM              140       688.914286
16-MAY-16 AM             4837       529.759768
16-MAY-16 AM            16082       591.632508
16-MAY-16 AM           280927       387.293204
16-MAY-16 AM           737846        519.94157
16-MAY-16 AM          1113762       428.772997
16-MAY-16 PM           562258       510.357372


Categories: DBA Blogs

Database Migration and Integration using AWS DMS

Kubilay Çilkara - Thu, 2016-05-12 14:12

Amazon Web Services (AWS) recently released a product called AWS Data Migration Services (DMS) to migrate data between databases.

The experiment

I have used AWS DMS to try a migration from a source MySQL database to a target MySQL database, a homogeneous database migration.

The DMS service lets you use a resource in the middle Replication Instance - an automatically created EC2 instance - plus source and target Endpoints. Then you move data from the source database to the target database. Simple as that. DMS is also capable of doing heterogeneous database migrations like from MySQL to Oracle and even synchronous integrations. In addition AWS DMS also gives you a client tool called AWS Schema Converter tool which helps you convert your source database objects like stored procedures to the target database format. All things a cloud data integration project needs!

In my experiment and POC, I was particularly interested in the ability of the tool to move a simple data model as below, with 1-n relationship between tables t0(parent) and t1(child) like below.

(Pseudo code to quickly create two tables t0, t1 with 1-n relationship to try it. Create the tables both on source and target database)

t0 -> t1 Table DDL (Pseudo code)

  `id` int(11) NOT NULL,
  `txt` varchar(100) CHARACTER SET ucs2 DEFAULT NULL,
  PRIMARY KEY (`id`)

  `id` mediumint(9) NOT NULL AUTO_INCREMENT,
  `t0id` int(9) DEFAULT NULL,
  `txt` char(100) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `t0id` (`t0id`),

In this experiment, I didn't want to see just a migration, a copy, of a table from source database to a target database. I was interested more to see how easy is to migrate a data model - with Primary Key and Foreign Key relationship in place -  from the source database to the target database with zero downtime and using their CDC (Changed data Capture) or Ongoing-Replication migration option and capabilities of AWS DMS. That is, zero downtime database migration.

Here are the results of the experiment.

AWS DMS is ubiquitous, you can quickly set-up an agent (Replication Instance) and define source & target endpoints and start mapping your tables to be migrated from source database to target database with the tool. All conveniently using the AWS console.

Once you setup your replication instance and endpoints, create a Migration Task (say Alpha) and do an initial full migration (load) from the source database to the target database. Do this with the foreign keys (FKs) disabled on the target. This is a recommendation in the AWS DMS Guide in order to dump the data super fast as it does it with parallel threads, at least this is the recommendations for MySQL targets.

Then you can create a second Migration Task (say Beta) using a different endpoint, but this time with the foreign keys enabled on the target. You can do this even before your full load with Alpha to avoid waiting times. Configure Beta interface/task to run forever and let it integrate and sync the delta which occurred during the initial load. You can even start the Beta interface from a cut-off timestamp point. It uses source MySQL database's binlogs to propagate the changes. If you don't create beta interface, that is to use a different endpoint for the target with the parameter which enables the FKs, the DELETE SQL statements on the source which occur during the migration will not propagate to the target correctly and the CASCADEs to the child tables will not work on the target. CASCADE is a property of the Foreign Key.

To reconcile, to find out if you have migrated everything, I had to count the rows in each table on source and the target databases to monitor and see if it all worked. To do that I used Pentaho Spoon CE to quickly create a job to count the rows on both source and target database and validate migration/integration interfaces.

Overall, I found AWS DMS very easy to use, it quickly helps you wire an integration interface in the Cloud and start pumping and syncing data between sources and targets databases be it on Premise or Cloud. A kind of Middleware setup in AWS style, in the Cloud. No more middleware tools for data migration, AWS now has it's own. 
Categories: DBA Blogs

Properly removing users in MySQL

Pythian Group - Thu, 2016-05-12 09:58

Removing users in MySQL is an easy task, but there are some drawbacks to check carefully before dropping a user. Not taking these possible issues into consideration can render your application unusable.

First it is important to understand the concept of user in MySQL. A user has three main functions: authentication, privileges and resources. This is different from other databases: in MySQL users don’t own objects. An object belongs to a database and there is no direct relationship between objects and users. This simplifies the process of deleting a user because no objects will be erased.

But users, as I wrote before, have an associated set of privileges that define what database sessions can do, and the privileges applied both in stored programs and view execution.

At the same time, procedures, functions, triggers, and views have two possible security contexts: invoker and definer. If they are created using the invoker security model, then the user calling the routine or view must have enough privileges to execute the code within the routine or view. But if created using the definer security model, the code can be executed if the definer has enough privileges to execute it. And yes, the default security model in MySQL is definer.

This means that, unless declared otherwise, most routines will check the privileges for the user that created the routine. If somebody removes that user, querying the view or executing the code will fail with error. All the procedures, functions, views and triggers created by that user with the default options will become unusable.

How do we fix this? The quick and dirty way is to recreate the user with a different password to avoid user logins. It is not an elegant solution but probably this is the first thing you will do while you solve the problem correctly.  Another alternative, if you are running MySQL 5.7 is account locking, this feature disables login for that account but allows code and views to be executed. In any case, it is a good practice to make a backup of the user creation scripts prior to dropping the user. Percona toolkit has the utility pt-show-grants for that purpose.

The elegant way to avoid the problem is to check that there are not routines or views using the definer security model and configured to run with the user privileges of the user you want to remove. There are several tables in the mysql user catalog that provide you with this information.

Unfortunately there is not an easy way to change this attributes. The best thing you can do is drop and recreate those objects using different security characteristics.

Let see an example:

Connect to the database using an account with enough privileges to create users, databases and procedures and create a new database only for testing purposes.

mysql> create database remove_test;
Query OK, 1 row affected (0,05 sec)

Create a user with privileges on the database created in previous step:

mysql> grant all privileges on remove_test.* to test_user@'%' identified by 'test';
Query OK, 0 rows affected, 1 warning (0,20 sec)

Connect to the database using the user created:

$ mysql -u test_user -ptest remove_test

Create a view, check the attributes and execute it. We are going to use a couple of functions that I will explain a bit later.

mysql> create view test_view as select current_user(),user();
Query OK, 0 rows affected (0,05 sec)
mysql> show create view test_view\G
*************************** 1. row ***************************
                View: test_view
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`test_user`@`%` SQL SECURITY DEFINER VIEW `test_view` AS select current_user() AS `current_user()`,user() AS `user()`
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0,00 sec)

mysql> select * from test_view;
| current_user() | user()              |
| test_user@%    | test_user@localhost |
1 row in set (0,00 sec)

As we can see, the view has been created with SQL SECURITY DEFINER and DEFINER=`test_user`@`%`. The view returns the value from two functions: current_user() returns the value that matches against the mysql.user table and that defines the privileges the current session or current execution context has. The function user() returns the host you are connected from and the username specified as connection identifier.

Now we reconnect to the database using the privileged account from the first step, and requery the view:

$ mysql -u root -p remove_test
mysql> select * from test_view;
| current_user() | user()         |
| test_user@%    | root@localhost |
1 row in set (0,00 sec)

The view is executed by root@localhost but with the privileges of test_user@%. It is important to note that if you monitor the execution of a routine or view, using SHOW PROCESSLIST or query against information_schema.processlist, the contents of user and host are the same that current_user() return, not the ones returned by user().

Now we will drop the user and query once again the view.

mysql> drop user test_user@'%';
Query OK, 0 rows affected (0,07 sec)
mysql> select * from test_view;
ERROR 1449 (HY000): The user specified as a definer ('test_user'@'%') does not exist

In case you don’t see the error and still get the correct results, this is because the contents of the view are stored in the query cache. Clear their contents and repeat the query.

mysql> reset query cache;
Query OK, 0 rows affected (0,00 sec)

How to validate if it is safe to remove a user? You must query the information_schema tables to find objects than can bring problems and recreate them.

mysql> select routine_schema as db,
    ->        routine_type as object_type,
    ->        routine_name as object_name
    -> from routines
    -> where security_type='DEFINER'
    ->   and definer='test_user@%'
    -> union all
    -> select trigger_schema as db,
    ->        'TRIGGER' as object_type,
    ->         trigger_name as object_name
    -> from triggers
    -> where definer='test_user@%'
    -> union all
    -> select table_schema as db,
    ->        'VIEW' as object_type,
    -> table_name as object_name
    -> from views
    -> where security_type='DEFINER'
    ->   and definer='test_user@%';
| db          | object_type | object_name |
| remove_test | VIEW        | test_view   |
1 row in set (0,02 sec)

Replace test_user@% with the value of the user you want to remove.

This way you get a list of the objects you must change to avoid problems. As I said before the best thing you can do to change the definer is recreating the objects. To obtain the creation script you should use SHOW CREATE VIEW, SHOW CREATE TRIGGER, SHOW CREATE PROCEDURE or SHOW CREATE FUNCTION.

If you want to avoid problems in the future, my recommendation is that for every application, create an account for that application, lock it with an unknown password or using account locking feature. Grant that account all the privileges needed to run the application and make sure all the views, procedures, functions and triggers have that account in the definer field.

Categories: DBA Blogs

Log Buffer #473: A Carnival of the Vanities for DBAs

Pythian Group - Wed, 2016-05-11 16:32

This Log Buffer Edition covers some of the hottest blog posts of Oracle, SQL Server and MySQL for the previous week.


A recent post on the OTN database forum reminded Jonathan how easy it is to forget to keep repeating a piece of information after the first couple of hundred times you’ve explained it. No matter how “intuitively obvious” it is for one person, it’s new to someone else.

The OMS Patcher is a newer patching mechanism for the OMS specifically.

A SQLT report has all kinds of pertinent information including—to name just a few—optimizer settings, indexes, statistics, plan history, and view definitions.

Joins are fundamental in SQL, and are used in most every production query. There are three types in particular that every developer should fully understand.

Why should Oracle Database professionals care about NoSQL and where to start?

SQL Server:

Understanding OPENJSON and FOR JSON in Azure and SQL 2016

Using Data Compression in Master Data Services 2016

The most frustrating thing with any new system is often just working out how to connect to it.

There are several decisions to be made when designing indexes for Memory-optimized tables in In-Memory OLTP, and to make the best choice, it pays to understand something about the nature and usage of memory-optimised indexes.

Database Hardware and Infrastructure Trends

STRING_SPLIT() in SQL Server 2016 : Follow-Up #2


Why would I run MySQL/MariaDB on POWER anyway?

By default, the write() system call returns after all data has been copied from the user space file descriptor into the kernel space buffers. There is no guarantee that data has actually reached the physical storage.

OpenCPS: Vietnam’s Public Sector goes Open Source

MySQL uses metadata locking to manage concurrent access to database objects, and to ensure data consistency when performing modifications to the schema

Using Docker for development is a great way of ensuring that what you develop will be the same that you deploy in production.

Categories: DBA Blogs

Defining Digital Transformation

Pythian Group - Wed, 2016-05-11 15:33


Terminology is important—and it’s particularly important for us to define terms that are central to what we do. So when it comes to the subject of digital transformation, what exactly are we talking about?


In speaking with clients and industry thought leaders, I’ve come to realize that the term “digital transformation” has a different meaning to different people. For a term that is so widely used — and that, on its surface, seems pretty straightforward — the range of interpretation is remarkable. It’s a bit like when we say “I’ll do it later.”  “Later” to one person means “before the sun goes down today.” “Later” to another person means “sometime in the future”, and it could mean days or weeks in their mind. “Later” to a third person can mean “I have no plans to do it, and this is my way of telling you nicely.”


Because the term is so essential to the work we do for our clients, I thought it would be helpful to define what digital transformation means to us here at Pythian. There’s so much we can say on the topic, so I plan to follow up with a series of articles about how I’ve seen it implemented, or worse, not implemented or even embraced as a concept.


To start, “digital transformation” is about technology. I know that to some people it isn’t, but I disagree. These days, you can’t transform your business without technology. It’s not about which technology you choose, as much as it’s about how to use it. Even more specifically, we’ve found that the businesses that are achieving positive transformation are using technology to capitalize on data. I have yet to see a single transformation project that didn’t use data as a major component of its success.


Let’s look at the term “transformation.” This equates to change, but it doesn’t mean change for its own sake. The change we’re talking about has to benefit the business. However, the factor that can make or break successful change is people. Their attitudes, preconceptions, and ideas almost always have to be aligned with the change for successful transformation to occur. People need to get behind the initiative, people have to fund it, people have to develop it, and people have to support it once it’s developed. And we all know that getting people to change can be more difficult than developing any new technology. In short, the transformative capabilities inherent in technology can only be realized when coupled with the willingness to embrace change.

Why Digital Transformation?

Why is the concept of digital transformation important in the first place? At Pythian, we believe that it’s about using technology and data to change your business for the better. What do we mean when we say “for the better”? Therein lies the controversy.  “For the better” means different things to different people depending on their company’s key objectives.


“For the better” can mean:

  • Becoming more efficient to drive costs down so your profitability can improve
  • Reducing mistakes and improving your reputation, or the quality of your product
  • Differentiating your product to get ahead of the competition
  • Doing what you do, only faster than your competitors
  • Creating new revenue streams
  • Improving the customer experience. This is a big one, so I will dedicate an entire blog post to exploring exactly what it means.


Digital transformation is the key to achieving any one, or all of these benefits, and knowing your objectives and priorities will help you shape your digital transformation initiative. So to start, focus less on what digital transformation is, and more on what you want the outcome of a transformation to be.


Categories: DBA Blogs

select from table with no direct relation or foriegn keys

Learn DB Concepts with me... - Wed, 2016-05-11 12:50
  scott.emp E

    SELECT      grade
  scott.emp E

Categories: DBA Blogs

GoldenGate 12.2 Big Data Adapters: part 4 – HBASE

Pythian Group - Wed, 2016-05-11 11:51

This is the next post in my series about Oracle GoldenGate Big Data adapters. Here is list of all posts in the series:

  1. GoldenGate 12.2 Big Data Adapters: part 1 – HDFS
  2. GoldenGate 12.2 Big Data Adapters: part 2 – Flume
  3. GoldenGate 12.2 Big Data Adapters: part 3 – Kafka
  4. GoldenGate 12.2 Big Data Adapters: part 4 – HBASE

In this post I am going to explore HBASE adapter for GoldenGate. Let’s start by recalling what we know about HBASE. The Apache HBASE is non-relational, distributed database. It has been modelled after the Google’s Bigtable distributed database. It can provide read write access to the data and is based on top of Hadoop or HDFS.

So, what does it tell us? First, we can write and change the data. Second, we need to remember that it is non-relation database and it is a bit of a different approach to data in comparison with traditional relation databases. You can think about HBase as about a key-value store. We are not going deep inside HBASE architecture and internals here, since our main task is to test Oracle GoldenGate adapter and see how it works. Our configuration has an Oracle database as a source with a GoldenGate extract and target system where we have Oracle GoldenGate for BigData.

We have more information about setting up the source and target in the first post in the series about HDFS adapter. The source side replication part has already been configured and started. We have initial trail file for data initialization and trails for the ongoing replication. We capture changes for all tables in the ggtest schema on the oracle database.
Now we need to prepare our target site. Let’s start from HBase. I used a pseudo-distributed mode for my tests where I ran a fully-distributed mode on a single host. It is not acceptable for any production configuration but will suffice for our tests. On the same box I have HDFS to serve as a main storage. Oracle documentation for the adapter states that they support HBase from version 1.0.x . In my first attempt I tried to use HBase version 1.0.0 (Cloudera 5.6) but it didn’t work. I got errors in the GoldenGate and my extract was aborted.
Here is the error :

2016-03-29 11:51:31  ERROR   OGG-15051  Oracle GoldenGate Delivery, irhbase.prm:  Java or JNI exception:
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)Lorg/apache/hadoop/hbase/HTableDescriptor;.
2016-03-29 11:51:31  ERROR   OGG-01668  Oracle GoldenGate Delivery, irhbase.prm:  PROCESS ABENDING.

So, I installed another version HBase and the version 1.1.4 worked just fine. I used simple, standard HBase configuration for pseudo-distributed mode where region server was on the same host as master and hbase.rootdir point to local hdfs.
Here is example of configuration:

[root@sandbox conf]# cat regionservers
[root@sandbox conf]#

As soon as we have HBase setup and running we can switch our attention to GoldenGate instead. We have already a trail file with initial load. Now we need to prepare our configuration files for initial and ongoing replication. Let’s go to our GoldenGate for Big Data home directory and prepare everything. In first, we need a hbase.conf file copied from $OGG_HOME/AdapterExamples/big-data/hbase directory to $OGG_HOME/dirprm. I left everything as it used to be in the original file changing only gg.classpath parameter to point it to my configuration files and libs for HBase.
Here is an example of the configuration files:

[oracle@sandbox oggbd]$ cat dirprm/hbase.props








javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

In second, we have to prepare a parameter file for our initial load. I used a simple file with minimum parameters.

[oracle@sandbox oggbd]$ cat dirprm/irhbase.prm
-- passive REPLICAT irhbase
-- Trail file for this example is located in "./dirdat/initld" file
-- Command to add REPLICAT
-- run replicat irhbase:
-- ./replicat paramfile dirprm/irhbase.prm reportfile dirrpt/irhbase.rpt
EXTFILE /u01/oggbd/dirdat/initld
TARGETDB LIBFILE libggjava.so SET property=dirprm/hbase.props

Having that configuration file we can run the replicat in passive mode from command line and see the result.
Here is initial status for HBASE:

hbase(main):001:0> version
1.1.4, r14c0e77956f9bb4c6edf0378474264843e4a82c3, Wed Mar 16 21:18:26 PDT 2016

hbase(main):001:0> list
0 row(s) in 0.3340 seconds

=> []

Running the replicat:

oracle@sandbox oggbd]$ ./replicat paramfile dirprm/irhbase.prm reportfile dirrpt/irhbase.rpt
[oracle@sandbox oggbd]$

Now we have 2 tables in HBASE:

hbase(main):002:0> list
2 row(s) in 0.3680 seconds


Let’s have a look to the tables structure and contains:

hbase(main):004:0> describe 'BDTEST:TEST_TAB_1'
N_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.2090 seconds

hbase(main):005:0> scan 'BDTEST:TEST_TAB_1'
ROW                                            COLUMN+CELL
 1                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30
 1                                             column=cf:PK_ID, timestamp=1459269153102, value=1
 1                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 1                                             column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5
 1                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
 2                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23
 2                                             column=cf:PK_ID, timestamp=1459269153102, value=2
 2                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 2                                             column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73
 2                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
 3                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30
 3                                             column=cf:PK_ID, timestamp=1459269153102, value=3
 3                                             column=cf:RND_STR, timestamp=1459269153102, value=RXZT5VUN
 3                                             column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5
 3                                             column=cf:USE_DATE, timestamp=1459269153102, value=2013-09-04:23:32:56
 4                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23
 4                                             column=cf:PK_ID, timestamp=1459269153102, value=4
 4                                             column=cf:RND_STR, timestamp=1459269153102, value=RXZT5VUN
 4                                             column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73
 4                                             column=cf:USE_DATE, timestamp=1459269153102, value=2013-09-04:23:32:56
4 row(s) in 0.1630 seconds

hbase(main):006:0> scan 'BDTEST:TEST_TAB_2'
ROW                                            COLUMN+CELL
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:ACC_DATE, timestamp=1459269153132, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:PK_ID, timestamp=1459269153132, value=7
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:RND_STR_1, timestamp=1459269153132, value=IJWQRO7T
1 row(s) in 0.0390 seconds


Everything looks good for me. We have structure and records as expected. Let’s go forward and setup ongoing replication.
I have created a parameter file for my replicat using the the initial load parameters as a basis:

[oracle@sandbox oggbd]$ cat dirprm/rhbase.prm
-- Trail file for this example is located in "dirdat/or" directory
-- Command to add REPLICAT
-- add replicat rhbase, exttrail dirdat/or
TARGETDB LIBFILE libggjava.so SET property=dirprm/hbase.props
MAP ggtest.*, TARGET bdtest.*;

We are checking our trail files and starting our replicat using the latest trail file. By default, a replicat would be looking for a trail with sequential number 0, but, since I have a purging policy on my GoldenGate it deletes old files and I need tell to replicat where to start exactly.

[oracle@sandbox oggbd]$ ll dirdat/
total 4940
-rw-r-----. 1 oracle oinstall    3028 Feb 16 14:17 initld
-rw-r-----. 1 oracle oinstall 2015199 Mar 24 13:07 or000043
-rw-r-----. 1 oracle oinstall 2015229 Mar 24 13:08 or000044
-rw-r-----. 1 oracle oinstall 1018490 Mar 24 13:09 or000045
[oracle@sandbox oggbd]$ ggsci

Oracle GoldenGate Command Interpreter
Version OGGCORE_12.
Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.

GGSCI (sandbox.localdomain) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt


GGSCI (sandbox.localdomain) 2> add replicat rhbase, exttrail dirdat/or,EXTSEQNO 45

GGSCI (sandbox.localdomain) 3> start replicat rhbase

Sending START request to MANAGER ...

GGSCI (sandbox.localdomain) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

REPLICAT    RUNNING     RHBASE      00:00:00      00:00:06

GGSCI (sandbox.localdomain) 5> info rhbase

REPLICAT   RHBASE    Last Started 2016-03-29 12:56   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:08 ago)
Process ID           27277
Log Read Checkpoint  File dirdat/or000045
                     2016-03-24 13:09:02.000274  RBA 1018490

GGSCI (sandbox.localdomain) 6>

I inserted number of rows to test_tab_1 on oracle side and all of them were successfully replicated to HBASE.

hbase(main):015:0> count 'BDTEST:TEST_TAB_1'
Current count: 1000, row: 1005694
Current count: 2000, row: 442
Current count: 3000, row: 6333
3473 row(s) in 1.0810 seconds

=> 3473

Let’s have a look bit close to test_tab_1 and test_tab_2:

hbase(main):005:0> scan 'BDTEST:TEST_TAB_1'
ROW                                            COLUMN+CELL
 1                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30
 1                                             column=cf:PK_ID, timestamp=1459269153102, value=1
 1                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 1                                             column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5
 1                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
 2                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23
 2                                             column=cf:PK_ID, timestamp=1459269153102, value=2
 2                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 2                                             column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73
 2                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20

hbase(main):006:0> scan 'BDTEST:TEST_TAB_2'
ROW                                            COLUMN+CELL
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:ACC_DATE, timestamp=1459269153132, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:PK_ID, timestamp=1459269153132, value=7
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:RND_STR_1, timestamp=1459269153132, value=IJWQRO7T
1 row(s) in 0.0390 seconds


You can see that row identifier for test_tab_1 is value for pk_id and for test_tab_2 it is concatenation of all values for all columns. Why is it so? The difference is in constraints for the tables. Since we don’t have a primary key or unique index for test_tab_2 it uses all columns as a key value. We can try to add a constraint and see the result.

select * from dba_constraints where owner='GGTEST' and table_name='TEST_TAB_2';

no rows selected

alter table ggtest.test_tab_2 add constraint pk_test_tab_2 primary key (pk_id);

Table altered.

insert into ggtest.test_tab_2 values(9,'PK_TEST',sysdate,null);

1 row created.


Commit complete.


And let us comare with result on the HBASE:

hbase(main):012:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:PK_ID, timestamp=1459275116849, value=7
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:RND_STR_1, timestamp=1459275116849, value=IJWQRO7T
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:ACC_DATE, timestamp=1459278884047, value=2016-03-29:15:14:37
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:PK_ID, timestamp=1459278884047, value=8
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:RND_STR_1, timestamp=1459278884047, value=TEST_INS1
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:TEST_COL, timestamp=1459278884047, value=TEST_ALTER
 9                                            column=cf:ACC_DATE, timestamp=1462473865704, value=2016-05-05:14:44:19
 9                                            column=cf:PK_ID, timestamp=1462473865704, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462473865704, value=PK_TEST
 9                                            column=cf:TEST_COL, timestamp=1462473865704, value=NULL
3 row(s) in 0.0550 seconds


It is fully dynamic and changed row id column on the fly. Will it work with unique index? Yes it will :

delete from ggtest.test_tab_2 where pk_id=9;

1 row deleted.

alter table ggtest.test_tab_2 drop constraint pk_test_tab_2;

Table altered.

create unique index ggtest.ux_test_tab_2 on ggtest.test_tab_2 (pk_id);

Index created.

insert into ggtest.test_tab_2 values(10,'UX_TEST',sysdate,null);

1 row created.


Here is the newly inserted row.

hbase(main):017:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 10                                           column=cf:ACC_DATE, timestamp=1462474389145, value=2016-05-05:14:53:03
 10                                           column=cf:PK_ID, timestamp=1462474389145, value=10
 10                                           column=cf:RND_STR_1, timestamp=1462474389145, value=UX_TEST
 10                                           column=cf:TEST_COL, timestamp=1462474389145, value=NULL
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:PK_ID, timestamp=1459275116849, value=7

But it will not make any difference if we just create an index on the source. It will not change anything. So, if we need to identify a key for a table we have to have at least unique constraint. Of course it is just default behavior for a schema replication and we may use KEYCOLS to identify keys for some tables.

Interesting that if we change a table structure it will affect all newly inserted rows but will not change existing even if we update some values. It works by this way if you have an unique identifier and it was not changed by your DDL operation.
Here is an example. We have a column “TEST_COL” in the table test_tab_2. Let’s drop the column and update the row. Keep in mind that our primary key is column PK_ID and we are not modifying the key.

alter table ggtest.test_tab_2 drop column TEST_COL;

Table altered.

update ggtest.test_tab_2 set rnd_str_1='TEST_COL' where pk_id=9;

1 row updated.


In HBASE we can see the same set of columns:

hbase(main):030:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 9                                            column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13
 9                                            column=cf:PK_ID, timestamp=1462477794597, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL
 9                                            column=cf:TEST_COL, timestamp=1462477581440, value=NULL
1 row(s) in 0.0200 seconds

We still have the deleted column TEST_COL even we’ve updated the row.
But if we insert any new row it will have the new set of columns:

insert into ggtest.test_tab_2 values(10,'TEST_COL',sysdate);

1 row created.


Commit complete.

And in HBASE:

hbase(main):031:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 10                                           column=cf:ACC_DATE, timestamp=1462477860649, value=2016-05-05:15:50:55
 10                                           column=cf:PK_ID, timestamp=1462477860649, value=10
 10                                           column=cf:RND_STR_1, timestamp=1462477860649, value=TEST_COL
 9                                            column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13
 9                                            column=cf:PK_ID, timestamp=1462477794597, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL
 9                                            column=cf:TEST_COL, timestamp=1462477581440, value=NULL
2 row(s) in 0.0340 seconds

And, as for all other cases, truncate on source table is not going to be replicated to the target and the operation will be ignored. You have to truncate the table in HBASE by yourself to keep the data in sync. In case you insert data again the data in HBASE will be “updated”. But it will not delete other rows. It will be more like a “merge” operation.
Here is an example:

truncate table ggtest.test_tab_2;

Table truncated.

insert into ggtest.test_tab_2 values(10,'TEST_COL2',sysdate);

1 row created.


Commit complete.

select * from ggtest.test_tab_2;

---------------- ---------- -----------------
	      10 TEST_COL2  05/05/16 16:01:20


hbase(main):033:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 10                                           column=cf:ACC_DATE, timestamp=1462478485067, value=2016-05-05:16:01:20
 10                                           column=cf:PK_ID, timestamp=1462478485067, value=10
 10                                           column=cf:RND_STR_1, timestamp=1462478485067, value=TEST_COL2
 9                                            column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13
 9                                            column=cf:PK_ID, timestamp=1462477794597, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL
 9                                            column=cf:TEST_COL, timestamp=1462477581440, value=NULL
2 row(s) in 0.0300 seconds


I spent some time testing performance and found the main bottleneck was my Oracle source rather than GoldenGate and HBASE. I was able to sustain transaction rate up to 60 DML per second and my Oracle DB started to struggle to keep pace because of waiting for a commit. The HBASE and replicat were absolutely fine. I also checked how it handles big transactions and inserted about 2 billion rows by one transaction. It worked fine. Of course it doesn’t prove that any of your production configurations will be without any performance issues. To conduct real performance tests I need to use much bigger environment.
In addition, I noticed one more minor error in Oracle documentation for adapter related to “keyValuePairDelimiter” parameter. In documentation it is replaced by “keyValueDelimiter”. It just small mistype and the “keyValueDelimiter” is repeated twice. First time it is correct and the second time it stands on the place where “keyValuePairDelimiter” is supposed to be. Here is the link.

As a summary I can say that despite some minor issues the adapters and GoldenGate for Big Data showed quite mature status and readiness for real work. I think it is good robust technology and, hopefully, its development will continue improving it with new releases. I am looking forward to use it in a real production environment with significant workload. In following posts I will try to test different DDL operations and maybe some other datatypes. Stay tuned.

Categories: DBA Blogs

run sql from windows cmd prompt CLI

Learn DB Concepts with me... - Wed, 2016-05-11 09:36

Method 1:

set ORACLE_HOME=D:\app\oracle\product\11.2.0\db_1

sqlplus -s "USERNAME/PASSWORD" @C:\Shell\Drop_objs.sql

sqlplus -s "USERNAME/PASSWORD" @C:\Shell\Change_pass.sql

Method 2:

set ORACLE_HOME=D:\app\oracle\product\11.2.0\db_1
(ECHO select username from dba_users where username in ('SCOTT');
ECHO exit;) | sqlplus -s "USER/PASS" > C:\Shell\test_out.txt

Categories: DBA Blogs

Generate Multiple AWR Reports Quickly

VitalSoftTech - Tue, 2016-05-10 20:40
Occasionally there is a need to generate multiple AWR reports for database analysis. In my case, a storage vendor will use a tool to extract data from all time periods from the AWR reports to find IO related specific information. Here is how I generated these reports.
Categories: DBA Blogs

Learn how to optimize text searches in SQL Server 2014 by using Full-Text Search – Part 1

Pythian Group - Tue, 2016-05-10 09:50

In this article, we’ll cover the functionality available in SQL Server 2014 for textual research known as Full-Text Search, its installation and implementation. Additionally, we will see how to develop textual searches using the predicates CONTAINS, CONTAINSTABLE, FREETEXT and FREETEXTTABLE, and use the FILESTREAM feature to improve the research and storage of binary data in tables.

The research based on words and phrases is one of the main features of the search tools on the web, like Google, and digital document management systems. To perform these searches efficiently, many developers create highly complex applications that do not have the necessary intelligence to find terms and phrases in the columns that store text and digital documents in the database tables.

What the vast majority of these professionals don’t know is that SQL Server has an advanced tool for textual research, the Full-Text Search (FTS).

FTS has been present in SQL Server since version 7, and through use textual searches can be performed both in columns that store characters, and in columns that store documents (for example, Office documents and PDFs), in its native form.

With options like searches for words and phrases, recognition of different languages, derivation of words (for example: play, played and playing), the possibility of developing a thesaurus, the creation of ranked results, and elimination of stopwords for search, FTS becomes a powerful tool for textual searches. As main factors for the use of textual searches we have:

  • The current databases are increasingly used as repositories of digital documents;
  • The cost for storage of information has slowed considerably, enabling the storage of Gigabytes, Terabytes and even Petabytes;
  • New types of digital documents are constantly being created, and the requirements for their storage, and subsequent research, are becoming larger and more complex;
  • Developers need a robust and reliable interface for performing textual research intelligence.

FTS has great advantages over other alternatives for textual research. For example, the command LIKE. The main tasks you can perform with FTS are:

  • Textual research based on linguistics. A linguistic research is based on words or phrases in a particular language, taking into consideration the verb conjugation, derived words, accent, among other features. Unlike the LIKE predicate, FTS uses an efficient indexing structure to perform textual research;
  • Automatic removal of stopwords informed in a textual research. The following are considered stopwords ones that don’t add to the result of the survey, such as from, to, the, the, a, an;
  • Assigning weights to the terms searched, making certain words are more important than others within the same textual research;
  • Generation of prioritization, allowing a better view of the documents that are most relevant according to the research carried out;
  • Indexing and searching in the most diverse types of digital documents. With FTS you can carry out searches in text files, spreadsheets, ZIP files, among others.

In this article will describe the architecture of the FTS, your installation and configuration, the main T-SQL commands used in textual research, the use of FTS in conjunction with the FILESTREAM, and also some techniques to optimize searches through the FTS.

FTS architecture

The architecture of the FTS has several components working in conjunction with the SQL Server query processor to perform textual research efficiently. The Figure 1 illustrates the major components of the architecture of the FTS. Let’s look at some of them:

  • Client Consultation: The client application sends the textual queries to the SQL Server query processor. It is the responsibility of the client application to ensure that the textual queries are written in the right way by following the syntax of FTS;
  • SQL Server Process (sqlservr.exe): The SQL Server process contains the query processor and also the engine of the FTS, which compiles and executes the textual queries. The integration between SQL Server and process the FTS offers a significant performance boost because it allows the query processor lot more efficient execution plans for textual searches;
  • SQL Server Query Processor: The query processor has multiple subcomponents that are responsible for validating the syntax, compile, generate execution plans and execute the SQL queries;
  • Full-Text Engine: When the SQL Server query processor receives a query FTS, it forwards the request to the FTS Engine. The Engine is responsible for validating FTS the FTS query syntax, check the full-text index, and then work together with the SQL Server query processor to return the textual search results;
  • Indexer: The indexer works in conjunction with other components to populate the full-text index;
  • Full-Text Index: The full-text index contains the most relevant words and their respective positions within the columns included in the index;
  • Stoplist: A stoplist is a list of stopwords for textual research. The indexer stoplist query during the indexing process and implementation of textual research to eliminate the words that don’t add value to the survey. SQL Server 2014 stores the stoplists within the database itself, thus facilitating their administration;
  • Thesaurus: The thesaurus is an XML file (stored externally to the database) in which you can define a list of synonyms that can be used for the textual research. The thesaurus must be based on the language that will be used in the search. The full-text engine reads the thesaurus file at the time of execution of research to verify the existence of synonyms that can increase the quality and comprehensiveness of the same;
  • Filter daemon host (fdhost.exe): Is responsible for managing the processes of filtering, word breaker and stemmer;
  • SQL Full-Text Filter Daemon Launcher (fdlauncher.exe): Is the process that starts the Filter daemon host (Fdhost.exe) when the full-text engine needs to use some of the processes managed by the same.


Figure 1. Architecture of FTS.

For the better understanding of the process of creation, use and maintenance of the structure of full-text indexes, you must also know the meaning of some important concepts. They are:

  • Term: The word, phrase or character used in textual research;
  • Full-Text Catalog: A group of full-text indexes;
  • Word breaker: The process that is the barrier every word in a sentence, based on the grammar rules of the language selected for the creation of full-text index;
  • Token: A word, phrase or character defined by the word breaker;
  • Stemmer: The process that generates different verb forms for the words, based on the grammar rules of the language selected for the creation of full-text index;
  • Filter: Component responsible for extracting textual information from documents stored with the data type varbinary(max) and send this information to the process word breaker.
Indexing process

The indexing process is responsible for the initial population of a full-text index and update of this index when the data modifications occur on the columns that have been indexed by FTS. This initialization process and update the full-text index named crawl.

When the crawl process is started, the FTS component known as protocol handler accesses the data in the table being indexed and begins the process to load into memory the existing content in this table, also known as streaming. To have access to data which are stored on disk, the protocol handler allows FTS to communicate with the Storage Engine. After the end of streaming the filter daemon host process performs data filtering, and initiates the processes of word breaker and stemmer for the filling in of the full-text index.

During the indexing process the stoplist is queried to remove stopwords, and so fill the structure of the full-text index with words that are meaningful to the textual research. The last step of the indexing process is known as a master merge, in which every word indexed are grouped in a single full-text index.

Despite the indexing process requires a high i/o consumption, it is not necessary to the blocking of the data being indexed. However a query performed using a full-text index during the indexing process can generate a result incomplete.

Full-Text query processing

For the full-text query processing are used the same words and phrases limiters that were defined by the Word breaker during the indexing process. You can also use additional components, as for example, the stemmer and the thesaurus, depending on the full-text predicates (CONTAINS or FREETEXT) used in the textual research. The use of full-text predicates will be discussed later in this article.

The process stemmer generates inflectional forms of the searched words. For example, from the term “play” is searched also the terms “played”, “play”, “play” beyond the term itself “play”.

Through rules created in the thesaurus file you can use synonyms to replace or expand the searched terms. For example, when performing a textual search using the term “Ruby”, the full-text engine can replace it by the synonym “red”, or else expand the research considering the terms automatically “red”, “wine”, “Scarlet” and also “Ruby”.

After processing of the full-text query, the full-text engine provides information to SQL query processor that assist in creating an execution plan optimized for textual research. There is a greater integration between the full-text engine and the query processor of SQL (both are components of the SQL Server process), enabling textual searches are conducted in a more optimized.

In the next post of this 4 part series, we will learn how to install the FTS and how to use it. Stay tuned!

Categories: DBA Blogs

Links for 2016-05-09 [del.icio.us]

Categories: DBA Blogs

Did You Know: Oracle EBS R12.2 #1 – Managing OHS

Pythian Group - Mon, 2016-05-09 15:21

For a long time now I’ve wanted to start a blog series outlining the not-so-obvious things that have changed in the new Oracle E-Business Suite R12.2. Here comes the new “Did You Know” series specifically for Oracle E-Business Suite! Lets start this series with Apache, aka Oracle HTTP Server.

People are already aware that OHS10g/OC4J is replaced with OHS11g/Weblogic in R12.2.X. On the surface it looks like a simple change, but a lot changed under the hood. In Oracle EBS 11i/R12.1, one could change apache port, but just updating the Context XML file and running autoconfig.

In R12.2, we have to change OHS stuff like web port by logging into the EM console url and then running $AD_TOP/bin/adSyncContext.pl script to sync OHS config parameters to the Context XML file. This script is only needed for OHS config for now. Any changes for Weblogic als need to be done in weblogic console url, not in Context XML file. But these changes in weblogic console are automatically propagated to xml file by the adRegisterWLSListeners.pl script.

You can get more details of the procedures by reviewing the MOS notes below.

  • E-Business Suite 12.2 Detailed Steps To Change The R12.2 Default Port To 80 (Doc ID 2072420.1)
  • Managing Configuration of Oracle HTTP Server and Web Application Services in Oracle E-Business Suite Release 12.2 (Doc ID 1905593.1)
Categories: DBA Blogs

Compression -- 7 : Updating after BASIC Compression

Hemant K Chitale - Mon, 2016-05-09 09:57
In the first blog post on compression, I had shown BASIC Compression and also the effect of executing an UPDATE on a table with BASIC compression.

To expand on the them of UPDATEs of BASIC compression blocks ....

SQL> select count(*) from source_data;


SQL> create table target_comp row store compress basic as select * from source_data where 1=2;

Table created.

SQL> select pct_free from user_tables where table_name = 'TARGET_COMP';


SQL> insert /*+ APPEND */ into target_comp
2 select * from source_data;

367156 rows created.

SQL> commit;

Commit complete.

SQL> insert /*+ APPEND */ into target_comp
2 select * from source_data;

367156 rows created.

SQL> commit;

Commit complete.

SQL> insert /*+ APPEND */ into target_comp
2 select * from source_data;

367156 rows created.

SQL> commit;

Commit complete.

SQL> analyze table target_comp compute statistics;

Table analyzed.

SQL> select chain_cnt, blocks from user_tables where table_name = 'TARGET_COMP';

---------- ----------
0 4452

SQL> exec dbms_stats.gather_table_stats('','TARGET_COMP');

PL/SQL procedure successfully completed.

SQL> select num_rows, chain_cnt, blocks from user_tables where table_name = 'TARGET_COMP';

---------- ---------- ----------
1101468 0 4452


So we have a table with 1.1million rows and no Row Chaining.

What happens if we update about 20% of the rows ?

SQL> begin
2 for rec in (select rowid from target_comp where rownum < 220001)
3 loop
4 update target_comp set owner=owner where rowid=rec.rowid;
5 commit;
6 end loop;
7 end;
8 /

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL> analyze table target_comp compute statistics;

Table analyzed.

SQL> select chain_cnt, blocks from user_tables where table_name = 'TARGET_COMP';

---------- ----------
202189 7882

SQL> exec dbms_stats.gather_table_stats('','TARGET_COMP');

PL/SQL procedure successfully completed.

SQL> select num_rows, chain_cnt, blocks from user_tables where table_name = 'TARGET_COMP';

---------- ---------- ----------
1101468 202189 7882


I have updated 220,000 rows without actually increasing the notional length of each row (I set OWNER=OWNER).  Yet, The CHAIN_CNT is now 202K and the table's HighWaterMark has expanded from 4,452 blocks to 7,882 blocks.  A significant increase !
(YMMV may vary in your tests !)

It was Jonathan Lewis who suggested getting the Chain Count (or LIST CHAINED ROWS) to understand the impact of UPDATEs on a table with BASIC compression.

Categories: DBA Blogs

Tip of the day: Always put this in your .bashrc

RDBMS Insight - Mon, 2016-05-09 06:52

if you like to scp:

# If not running interactively, don't do anything
[[ $- == *i* ]] || return

Otherwise scp will fail without error – it’s a known bug.

Categories: DBA Blogs

MySQL InnoDB’s Full Text Search overview

Pythian Group - Fri, 2016-05-06 12:56

NOTE: If you want to read and play with the interactive application, please go to the shinnyapps article. It has been developed using Shiny/R in order to allow you to see the effects of the algorithms.

Thanks to Valerie Parham-Thompson at Pythian and Daniel Prince at Oracle.

Github repository contains the code to generate and load the data and also, the Shiny/R code.

Some initial thoughts

A couple of weeks ago one of our customers came up with a question regarding FTS over InnoDB engine. Although the question is not answered in the current article, I came up with the conclusion that FTS is sometimes misunderstood.

The point of this article is to show dynamically how the search algorithms work, using non-fictional data (data sources were downloaded from Gutenberg project within an easy interface (please see at the bottom of the ShinnyApps post here) .

In order to show the effects off the field sizes over the query expansion algorithm, you will see two main tables (bookContent and bookContentByLine) both containing the same books in different approaches: by line and by paragraph. You’ll see the noise generated by the QUERY EXPANSION algorithm when phrases are too large.

For the sake of simplicity, in this article we won’t go through the FTS parsers. That is possible material for a future post.

Why I consider FTS sometimes misunderstood?

FTS is a technology that can be use for any purpose, not only simple searches. Generally, FTS engines are placed to work as a service for web or document searches, which generally require technologies like Solr, ElasticSearch or Sphinx. However, certain bussines rules require complex searches, and having such feature inside RDBMS can be a win.

RDBMS aren’t a good place for massive amount of FTS queries, without using any of the join capabilities that they offer, or the ACID properties.

As I said above, FTS is totally acceptable in RDBMS, if you are using at least one RDBMS main feature, required by your bussines model.


To start showing the effects of the algorithms, the following example searches the word ‘country’ using query expansion. This means that we are not looking only the exact matches, but also the entries that appear the most when the the exact match has been found.

In the SELECT clause you’ll see both FTS expressions using NATURAL LANGUAGE with query expansion and BOOLEAN modes respectively.

View the code on Gist.

The noise generated by the query expansion is expected and described in the official documentation here.

The interesting case is the following row, which has 2 exact occurrences (you can see the positions 1 and 63) and it is not the highest rank using query extension. Remember, this is expected.

Text: "country districts. As Lucca had five gates, he divided his own country"
bookid: 1232
pos: 1,63
QERank: 80
BoolRank: 14

This is even worse when using large sentences. In the example bellow you will see the same query, against the table storing by paragraph. The boolean rank shows some of the entries way above others, however the query extension locates at the top records that not necessarily has a lot of exact matches.

View the code on Gist.

The query expansion is useful when you intend to search which entries contain more words that appear frequently within the search term. Having large text fields increase the probability to have more words that appear among the search term. In the case of bookContent table (by paragraph table), the average field size is 443.1163 characters.


There is a way to play with the contents of the FTS indexes. As you may noticed in the previous examples, I used the set global innodb_ft_aux_table = 'ftslab/bookContent'; statement, which loads the index content to memory for an easy querying.

If you use RDS, the option innodb_ft_aux_table is not available as it is GLOBAL and require SUPER privileges.

i.e. You can easily get the most frequent tokens:

View the code on Gist.

We can query the index contents with a simple SQL statement like the following:

View the code on Gist.

In the example shown before the is no intention to compare ranks score as they are based in different algorithms. The idea there is to show that QUERY EXPANSION can have non desire results in some cases due to its mechanism.

Building custom stopwords

It probably isn’t very useful information as most of these words appears too frequently and are modal verbs, adverbs, pronouns, determiners, etc. It could be the case that you are not interested on indexing those words. If that’s the case you can add them as stopwords in your own stopwords table. Specially if you are more interested in boolean searches, loosing some part of the language expressions.

We can build a custom stopwords table based on our current data:

View the code on Gist.

Let’s build our stopwords table using both default and new entries and keeping the alphabetical order:

View the code on Gist.

The idea behind choosing our own stopwords is to measure how much index do we safe filtering those words that are extremely frequent and don’t add a necessary meaning to the search. This topic could be covered in a separate blog post.

Going ahead on choosing stop words

The full article is amazingly interesting. In brief, it says that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on (rank-frequency distribution is an inverse relation).

Considerations and recommendations

– Use QUERY EXPANSION only if you are interested in searching relations over exact matches. Remember that the field
size is crucial when using this.
– FTS is not the best fit for exact string matches in single columns. You don’t want to use FTS for searching emails in a single column, name and lastname fields , i.e. For those, you’ll probably use other techniques as reverse searches , exact match operator (=) or hashing (CRC32 for emails or large texts smaller than 255 characters).
– Keep your FTS indexes short. Do not add ALL the text columns. Parse first from your application the user search and adapt the query.
– If you are using BOOLEAN MODE, you can use the rank score to filter rows. MySQL is clever enough to optimize the
FTS functions to avoid double executions. You can do this using something like: match(content,title) against ("first (third)") > 1 . Generally, scores lower than 1 can be ignored when using boolean or natural mode searches.
– `OPTIMIZE TABLE` does a rebuild of the table. To avoid this, set innodb_optimize_fulltext_only=1 in order to do an incremental maintance on the table.
– Recall that NATURAL LANGUAGE MODE does not take the operands as the BOOLEAN MODE. This affects the ranking score (try +bad (thing) i.e.)
– If you plan to order by rank, it is not necessary to specify the clause `ORDER BY` as InnoDB does the order after retrieve the doc ids . Also,the behavior is different from the default as it returns the heaviest at the top (like an ORDER BY rank DESC).
– If you come from MyISAM’s FTS implementation, recall that the ranking scoring is different.
– Create the FULLTEXT index after the data is loaded InnoDB Bulk Load. When restoring FTS backups, you will probably hit the “ERROR 182 (HY000) at line nn: Invalid InnoDB FTS Doc ID”.
– Try to avoid using use more than one FTS expression in the where clause. Keep in mind that this affects the order in the results and it consumes a considerably amount of CPU. InnoDB orders by the latest expression in the WHERE clause. WL#7123.
– Also, if avoiding the rank information in the projection (SELECT clause) and using other aggregations like count(*), will use the “no ranking” FT_hints. The LIMIT hint won’t be used if invoked explicitly an ORDER BY and the MATCH clause in the projection.

View the code on Gist.

– If you plan to use FTS_DOC_ID column with AUTO_INCREMENT option, have in mind that there is a limitation regarding this. You must declare a single column PRIMARY KEY constraint or as an UNIQUE index. Also, the data type is stricted as `bigint unsigned`. i.e:

View the code on Gist.


This variable controls the number of top matches when using `WITH QUERY EXPANSION` (affects only MyISAM). Reference.

Bug 80347 – Invalid InnoDB FTS Doc ID

emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < ddl.dump
emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < onlyData.dump
emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

mysqldump is not very clever if you use `FTS_DOC_ID`:

2016-02-13T22:11:53.125300Z 19 [ERROR] InnoDB: Doc ID 10002 is too big. Its difference with largest used Doc ID 1 cannot exceed or equal to 10000

It takes dumps without considering the restriction coded in `innobase/row/row0mysql.cc`:

Difference between Doc IDs are restricted within
4 bytes integer. See fts_get_encoded_len()

The fix to this is backuping the table by chunks of 10000 documents.

Other useful links

Fine tuning
Maintenance: innodb_optimize_fulltext_only
Writing FTS parser plugins

Categories: DBA Blogs

Links for 2016-05-05 [del.icio.us]

Categories: DBA Blogs

Partner Webcast – Oracle BI cloud service: Insights at Your Fingertips

As business is constantly changing, we find ourselves wondering… are we prepared for this? In all enterprises, one of the most feared questions is…”What caused this recent dip in sales?”....

We share our skills to maximize your revenue!
Categories: DBA Blogs

Comparing Common Queries Between Test and Production

Bobby Durrett's DBA Blog - Thu, 2016-05-05 13:58

The developers complained that their test database was so much slower than production that they could not use it to really test whether their batch processes would run fast enough when migrated to production. They did not give me any particular queries to check. Instead they said that the system was generally too slow. So, I went through a process to find SQL statements that they had run in test and that normally run in production and compare their run times. I thought that I would document the process that I went through here.

First I found the top 100 queries by elapsed time on both the test and production databases using this query:

column FORCE_MATCHING_SIGNATURE format 99999999999999999999

sum(ELAPSED_TIME_DELTA) total_elapsed
order by total_elapsed desc)
where rownum < 101;

The output looked like this:


Then I found the signatures that were in common between the two lists.

insert into test_sigs values (944718698451269965);
insert into test_sigs values (4634961225655610267);
insert into test_sigs values (15939251529124125793);
insert into prod_sigs values (3898230136794347827);
insert into prod_sigs values (944718698451269965);
insert into prod_sigs values (11160330134321800286);
select * from test_sigs
select * from prod_sigs;

This led to 32 values of FORCE_MATCHING_SIGNATURE which represented queries that ran on both test and production, except for the possible difference in constants.

Next I looked at the overall performance of these 32 queries in test and production using this query:

create table common_sigs

insert into common_sigs values (575231776450247964);
insert into common_sigs values (944718698451269965);
insert into common_sigs values (1037345866341698119);

sum(executions_delta) total_executions,
from DBA_HIST_SQLSTAT ss,common_sigs cs

Here is part of the output:

TOTAL_EXECUTIONS Elapsed Average ms CPU Average ms IO Average ms
---------------- ------------------ -------------- -------------
         5595295         366.185529      241.92785    59.8682797
          430763         1273.75822     364.258421    1479.83294

The top line is production and the bottom is test.

This result supported the development team’s assertion that test was slower than production. The 32 queries averaged about 3.5 times longer run times in test than in production. Also, the time spent on I/O was about 25 times worse. I am not sure why the I/O time exceeded the elapsed time on test. I guess it has something to do with how Oracle measures I/O time. But clearly on average these 32 queries are much slower on test and I/O time probably caused most of the run time difference.

After noticing this big difference between test and production I decided to get these same sorts of performance metrics for each signature to see if certain ones were worse than others. The query looked like this:

sum(executions_delta) total_executions,
from DBA_HIST_SQLSTAT ss,common_sigs cs
sum(executions_delta) > 0
group by
order by

I put together the outputs from running this query on test and production and lined the result up like this:

------------------------ ------------------ ------------------
      575231776450247964         20268.6719         16659.4585
      944718698451269965         727534.558          3456111.6 *
     1037345866341698119         6640.87641         8859.53518
     1080231657361448615         3611.37698         4823.62857
     2879196232471320459         95723.5569         739287.601 *
     2895012443099075884         687272.949         724081.946
     3371400666194280661         1532797.66         761762.181
     4156520416999188213         109238.997         213658.722
     4634693999459450255          4923.8897         4720.16455
     5447362809447709021         2875.37308          2659.5754
     5698160695928381586         17139.6304         16559.1932
     6260911340920427003         290069.674         421058.874 *
     7412302135920006997         20039.0452         18951.6357
     7723300319489155163         18045.9756         19573.4784
     9153380962342466451         1661586.53         1530076.01
     9196714121881881832         5.48003488         5.13169472
     9347242065129163091         4360835.92         4581093.93
    11140980711532357629         3042320.88         5048356.99
    11160330134321800286         6868746.78         6160556.38
    12212345436143033196          5189.7972         5031.30811
    12776764566159396624         139150.231         614207.784  *
    12936428121692179551         3563.64537         3436.59365
    13637202277555795727          7360.0632         6410.02772
    14067042856362022182         859.732015         771.041714
    14256464986207527479         51.4042938         48.9237251
    14707568089762185958         627.586095          414.14762
    15001584593434987669         1287629.02         1122151.35
    15437049687902878835         96014.9782         996974.876  *
    16425440090840528197         48013.8912         50799.6184
    16778386062441486289         29459.0089         26845.8327
    17620933630628481201         51199.0511         111785.525  *
    18410003796880256802         581563.611         602866.609

I put an asterisk (*) beside the six queries that were much worse on test than production. I decided to focus on these six to get to the bottom of the reason between the difference. Note that many of the 32 queries ran about the same on test as prod so it really isn’t the case that everything was slow on test.

Now that I had identified the 6 queries I wanted to look at what they were spending their time on including both CPU and wait events. I used the following query to use ASH to get a profile of the time spent by these queries on both databases:

when 'WAITING' then event
(count(*)*10) seconds
order by seconds desc;

The profile looked like this in test:

------------------------ -------
db file parallel read     207450
ON CPU                    141010
db file sequential read    62990
direct path read           36980
direct path read temp      29240
direct path write temp     23110

The profile looked like this in production:

------------------------ -------
ON CPU                    433260
PX qref latch              64200
db file parallel read      35730
db file sequential read    14360
direct path read           12750
direct path write temp     12000

So, I/O waits dominate the time on test but not production. Since db file parallel read and db file sequential read were the top I/O waits for these 6 queries I used ash to see which of the 6 spent the most time on these waits.

db file parallel read:

  2  sql_id,
  3  (count(*)*10) seconds
  5  where
  7  ('944718698451269965',
  8  '2879196232471320459',
  9  '6260911340920427003',
 10  '12776764566159396624',
 11  '15437049687902878835',
 12  '17620933630628481201') and
 13  event='db file parallel read'
 14  group by sql_id
 15  order by seconds desc;

SQL_ID           SECONDS
------------- ----------
ak2wk2sjwnd34     159020
95b6t1sp7y40y      37030
brkfcwv1mqsas      11370
7rdc79drfp28a         30

db file sequential read:

  2  sql_id,
  3  (count(*)*10) seconds
  5  where
  7  ('944718698451269965',
  8  '2879196232471320459',
  9  '6260911340920427003',
 10  '12776764566159396624',
 11  '15437049687902878835',
 12  '17620933630628481201') and
 13  event='db file sequential read'
 14  group by sql_id
 15  order by seconds desc;

SQL_ID           SECONDS
------------- ----------
95b6t1sp7y40y      26840
ak2wk2sjwnd34      22550
6h0km9j5bp69t      13300
brkfcwv1mqsas        170
7rdc79drfp28a        130

Two queries stood out at the top waiters on these two events: 95b6t1sp7y40y and ak2wk2sjwnd34. Then I just ran my normal sqlstat query for both sql_ids for both test and production to find out when they last ran. Here is what the query looks like for ak2wk2sjwnd34:

select ss.sql_id,
ELAPSED_TIME_DELTA/(executions_delta*1000) "Elapsed Average ms",
CPU_TIME_DELTA/(executions_delta*1000) "CPU Average ms",
IOWAIT_DELTA/(executions_delta*1000) "IO Average ms",
CLWAIT_DELTA/(executions_delta*1000) "Cluster Average ms",
APWAIT_DELTA/(executions_delta*1000) "Application Average ms",
CCWAIT_DELTA/(executions_delta*1000) "Concurrency Average ms",
BUFFER_GETS_DELTA/executions_delta "Average buffer gets",
DISK_READS_DELTA/executions_delta "Average disk reads",
ROWS_PROCESSED_DELTA/executions_delta "Average rows processed"
where ss.sql_id = 'ak2wk2sjwnd34'
and ss.snap_id=sn.snap_id
and executions_delta > 0
order by ss.snap_id,ss.sql_id;

I found two time periods where both of these queries were recently run on both test and production and got an AWR report for each time period to compare them.

Here are a couple of pieces of the AWR report for the test database:



Here are similar pieces for the production database:

top5 foreground elapsed

What really stood out to me was that the wait events were so different. In production the db file parallel read waits averaged around 1 millisecond and the db file sequential reads averaged under 1 ms. On test they were 26 and 5 milliseconds, respectively. The elapsed times for sql_ids 95b6t1sp7y40y and ak2wk2sjwnd34 were considerably longer in test.

This is as far as my investigation went. I know that the slowdown is most pronounced on the two queries and I know that their I/O waits correspond to the two wait events. I am still trying to find a way to bring the I/O times down on our test database so that it more closely matches production. But at least I have a more narrow focus with the two top queries and the two wait events.


Categories: DBA Blogs

InnoDB flushing and Linux I/O

Pythian Group - Thu, 2016-05-05 12:06

Since documentation is not very clear to me on the topic of InnoDB flushing in combination with Linux IO (specifically the write system call), I decided to put together this article in hopes of shedding some light on the matter.

How Linux does I/O

By default, the write() system call returns after all data has been copied from the user space file descriptor into the kernel space buffers. There is no guarantee that data has actually reached the physical storage.

The fsync() call is our friend here. This will block and return only after the data and metadata (e.g. file size, last update time) is completely transferred to the actual physical storage.

There is also fdatasync() which only guarantees the data portion will be transferred, so it should be faster.

There are a few options that we can specify at file open time, that modify the behaviour of write():


In this case, the write() system call will still write data to kernel space buffers, but it will block until the data is actually transferred from the kernel space buffers to the physical storage. There is no need to call fsync() after.


This completely bypasses any kernel space buffers, but requires that the writes are the same size as the underlying storage block size (usually 512 bytes or 4k). By itself, it does not guarantee that the data is completely transferred to the device when the call returns.


As stated above, we would need to use both options together guarantee true synchronous IO.

Relation with InnoDB flushing

Innodb_flush_method parameter controls which options will be used by MySQL when writing to files:

At the time of this writing, we have the following options:


This is the default value, and is equivalent to specifying fsync option.


Both data and redo log files will be opened without any special options, and fsync() will be used when the db needs to make sure the data is transferred to the underlying storage.


This one is confusing, as O_DSYNC us actually replaced with O_SYNC within the source code before calling open(). It is mentioned this is due to some problems on certain Unix versions. So O_SYNC will be used to open the log files, and no special options for the datafiles. This means fsync() needs to be used to flush the data files only.


Data files are opened with O_DIRECT. Log files are opened with no extra options. Some filesystems (e.g. XFS) do not guarantee metadata without the fsync() call, so it is still used as safety measure.


InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call afterwards. This can provide some performance benefits if you are using a filesystem that does not require the fsync() to sync metadata.

I am deliberately not mentioning the experimental options littlesync and nosync.

There is also an extra option in Percona Server:


It uses O_DIRECT to open the log files and data files and uses fsync() to flush both the data and the log files.

Which InnoDB flushing method should I use?

The general consensus if you have a battery backed write cache or fast IO subsystem (e.g. SSD’s) is to use the O_DIRECT method. However it is a better practice to run tests to determine which method provides a better performance for each particular environment.


One downside of using O_DIRECT is that it requires the innodb-buffer-pool-size to be configured correctly. For example, if you accidentally leave your buffer pool size at the default value of 128M, but have 16G of RAM, the buffer pool contents will at least sit in the filesystem cache. This will not be true if you have O_DIRECT enabled (I would like to thank Morgan Tocker for his contribution regarding this section of the post).



Categories: DBA Blogs


Subscribe to Oracle FAQ aggregator - DBA Blogs