Skip navigation.

DBA Blogs

Indexes and Initrans (Blackstar)

Richard Foote - Thu, 2016-01-14 00:26
It’s been a very tough week so to help keep my mind off rather sad events, thought I’ll finish off one of my unpublished articles. Initrans is a physical attribute that determines the initial number of concurrent transaction entries allocated within each data block for a given table/index/cluster. Every transaction that updates a block has to acquire an […]
Categories: DBA Blogs

Configure High Availability – Load Balancing for Hiveserver2

Pythian Group - Tue, 2016-01-12 15:21
We are noticing a steady adoption of Hive adoption among our customers. Due to the increased workload we want to make sure that performance and availability of Hive is not compromised. The following steps will ensure the smooth functioning of Hive under increased workloads.

Configuring High Availability for Hive requires the following components to be fail proof:

1. Hive Metastore underlying RDBMS
2. Zookeeper
3. Hive Metastore Server
4. Hiveserver2

For the sake of simplicity this blog will focus on enabling HA for the Hive Metastore Server and HiveServer2. We recommend that the underlying Hive Metastore underlying RDBMS be configured for High Availability and we have configured multiple Zookeeper instances on the current cluster.

Enabling High Availability for Hive Metastore Server1. Log on to Cloudera Manager2. Click on HIVE > Hive Metastore Server. Locate the host for the Hive Metastore Server.


3. SSH to Hive Metastore Server.# vi /etc/hive/conf.cloudera.hive/hive-site.xmlExpected Output below.


4. On the Cloudera Manager Console click Hive > Configuration

Select Scope > Hive Metastore Server.
Select Category > Advanced.
Locate the Hive Metastore Delegation Token Store property.
Choose org.apache.hadoop.hive.thrift.DBTokenStore
Click Save Changes.


5. On the Cloudera Manager Console click Hive > Instances. Click on Add Role Instances.

Click on Select Hosts for Hive Metastore Server.


6. Choose multiple Hosts (at least 2 more to make a total of 3) to configure Hive Metastore Server on.

Click OK and Continue.


7. Click Finish. You should now see new hosts added as the Hive Metastore Server.Click on Restart the service (or the instance) for the changes to take effect.


8. Notice that hive.metastore.uris now has multiple instances of Hive Metastore Server.

Click on Restart Stale Service.


9. Click Restart Now.


10. Review Restart Messages.


11. Notice that you now have multiple instances of Hive Metastore Server.


12. SSH again to Hive Metastore Server.# vi /etc/hive/conf.cloudera.hive/hive-site.xmlExpected Output below. Note that new instances have been added.


So how do you know the settings are working? The following is the recommended plan for testing the High Availability of Hive MetaStore.1. SSH to any DataNode. Connect to Hiveserver2 using Beeline.

# beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:10000”


2. On the Cloudera Manager Console click Hive > Hive MetaStore Server.Stop the first Hive MetaStore Server in the list.


Issue “Show databases” command in the beeline shell of step 1. The command should work normally.

3. Stop the second Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.The command should still work normally.

4. Stop the third Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.This command should fail which is normal.


Expected Output from beeline below.


5. Now start a random Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.This command should start working normally again.

6. After testing it completed make sure you start all Hive Metastore Servers in the list.

Enabling Load Balancing and High Availability for Hiveserver2To provide high availability and load balancing for HiveServer2, Hive provides a function called dynamic service discovery where multiple HiveServer2 instances can register themselves with Zookeeper. Instead of connecting to a specific HiveServer2 directly, clients connect to Zookeeper which returns a randomly selected registered HiveServer2 instance.

1. Log on to Cloudera Manager.Click Hive > Instances. Click on Add Role Instances.

Click on Select Hosts for HiveServer2.


2. Choose multiple Hosts (at least 2 more to make a total of 3) to configure HiveServer2 on.

Click OK and Continue.


3. You should now see new hosts added as HiveServer2.

Choose the newly added instances and Choose Start.


4. Click on Close. The newly added HiveServer2 instances are now ready for use.


5. Open Hive -> Configuration -> Category -> Advanced.Find “HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml”
Add a new property as below:

Value: true


6. Go to the Cloudera Manager Home Page and Restart Hive Service.


7. You should now have multiple instances of HiveServer2.


So how do you know the settings are working? Following is the recommended plan for testing the Load Balancing for Hiveserver2.

1. As mentioned before HiveServer2 High Availability is managed through Zookeeper.


The clients connecting to HiveServer2 now go through Zookeeper. An example, JDBC connect string is as follows. Notice that the JDBC now points to a list of nodes that have Zookeeper on them.

beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”

2. SSH to any data node. Connect to Hiveserver2 using Beeline.# beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”


3. The connection gets routed to the HiveServer2 instances in a round robin fashion.

Issue the following command on the HiveServer2 nodes.

# tail -f /var/log/hive/hadoop-cmf-hive-HIVESERVER2-ip-10-7-176-204.ec2.internal.log.out

Issue the following command on the HiveServer2 nodes.


4. You may issue the beeline command from multiple sources and monitor the HiveServer2 logs.


So how do you know the settings are working? Following is the recommended plan for testing the High Availability for Hiveserver2.

1. On the Cloudera Manager Console click Hive > HiveServer2.Stop the first HiveServer2 in the list


Connection to Beeline using command below should work normally.

# beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”

2. Stop the second HiveServer2 in the list


Connection to Beeline using command below should still work normally.

3. Stop the third HiveServer2 in the list.


Connection to Beeline using command below should fail.


4. Start the third HiveServer2 in the list.


Connection to Beeline using command below should work normally again.

5. After the testing completes make sure you start all HiveServer2 in the list.


Discover more about our expertise in Big Data and Hadoop.

Categories: DBA Blogs

Log Buffer #456: A Carnival of the Vanities for DBAs

Pythian Group - Tue, 2016-01-12 07:38

This Log Buffer Edition covers many aspects discussed this week in the realms of Oracle, SQL Server and MySQL.


  • Oracle and Informatica have a very close working relationship and one of the recent results of this collaboration is the joint project done by Informatica and our Oracle ISV Engineering team to test the performance of Informatica software with Oracle Database 12c In-memory on Oracle SPARC systems.
  • The only thing you can do easily is be wrong, and that’s hardly worth the effort.
  • Enterprise Manager 13c: What’s New in Database Lifecycle Management.
  • SnoopEE is a very interesting grass roots open source Java EE ecosystem project. Akin to NetFlixOSS Eureka it enables microservices discovery, lookup and registration.
  • Docker is an open source container technology that became immensely popular in 2014. Docker itself is written in Google’s programming language “Go” and supported on all major Linux distributions (RedHat, CentOS, Oracle Linux, Ubuntu etc.).

SQL Server:

  • This blog helps you understand Graphical Execution Plans in SQL Server.
  • New DAX functions in SQL Server 2016.
  • JSON support in SQL Server 2016.
  • PowerShell Tool Time: Building Help.
  • Datetime vs. Datetime2.


  • Peter Gulutzan discusses SQL qualified names.
  • It is not new that we can store a JSON content in a normal table text field. This has always been the case in the past. But two key features were missing: filtering based on JSON content attributes and indexing of the JSON content.
  • MySQL 5.7 Multi-Source Replication – Automatically Combining Data From Multiple Databases Into One.
  • OOM killer vs. MySQL 5.7.10 epic results with over-allocating memory.
  • Apache Spark with Air ontime performance data.


Learn more about Pythian’s expertise in Oracle SQL Server & MySQL.

Categories: DBA Blogs

David Bowie 1947-2016. My Memories.

Richard Foote - Mon, 2016-01-11 20:06
In mid-April 1979, as a nerdy 13 year old, I sat in my bedroom in Sale, North England listening to the radio when a song called “Boys Keep Swinging” came on by an singer called David Bowie who I never heard of before. I instantly loved it and taped it next time it came on the radio via my […]
Categories: DBA Blogs

Pythian – Through the Eyes of a CO-OP Student

Pythian Group - Mon, 2016-01-11 15:32


I worked at Pythian as a Global Talent Acquisition Coordinator (co-op student) within the company’s human resources function in Ottawa, Canada. At the core of my role, I provided direct support to my colleagues in guiding job applicants through our hiring process. I also had the opportunity to take on and create small projects for myself during my four months there.

Since our past technical CO-OPs have written fantastic blogs about their experiences at Pythian (be sure to read them!), how about I write this one about the business side of working there? Here are my top three reasons why any business student would want to work at Pythian:

1. What better place is there to develop cross-cultural literacy?

With Pythian, I had the pleasure of working with remote and international colleagues for the first time. Top that with actively communicating with a global pool of job applicants on a daily basis. Succeeding in this kind of environment definitely requires you to be cross-culturally literate, which means that you understand how cultural differences—both inside and outside an organization—affect a business’s day-to-day practices.

In business school, we are often reminded about the importance of considering the external environment when a firm goes global (CDSTEP, anyone?), so it was quite eye-opening to see how my experience at Pythian really validated my studies. For example, processes that are of no legal concern in Canada might present a huge obstacle when hiring abroad, and pieces of information that North Americans prefer to keep private are quite openly discussed in other cultures. Talking to candidates from around the world definitely taught me how to think more critically about my communication—not only in terms of what I say, but also how I say it.

2. It feels nice not to be just “the CO-OP student”.

Upon my first day, I immediately felt that I would not be looked at as simply an intern. My team greeted me with open arms (already knowing my name!) and repeatedly emphasized the value of having me on board throughout my term. Within days, I could already understand the significance of my tasks and how they contributed not only to the team, but also to the organization as a whole.

Another great thing about not being just “the CO-OP student” is empowerment. At Pythian, you are really treated like a colleague rather than a subordinate. I never worked for my boss, but I always worked with her. During my term, my team enthusiastically invited me to explore our work processes and offer ideas to make things better. It was pretty cool to see my thoughts being listened to and put into action, even after my departure!

3.There’s nothing more rewarding than stepping out of your comfort zone.

One of the things I hated doing most before working at Pythian was using the phone. If you run a quick Google search, you will find many studies showing that Millennials are not as comfortable with making phone calls as their predecessors were—and I could speak to that, 100 percent! Want me to order a pizza? I’ll do it through the restaurant’s online platform. Am I due for a dentist appointment? I’ll ignore the receptionist’s voicemails until she sends me an e-mail (although my telephonophobia might not be the only reason for that one!).

My colleagues helped me overcome this discomfort by having me conduct reference checks. As embarrassing as it might sound, I actually had to take the greatest leap of faith to get my fingers dialling for the first time. Although I certainly had a mini heart-attack whenever I was asked to help with a reference, I eventually eased into the task with time. While I might still shy away from the telephone now and then, I really do feel accomplished in getting more comfortable with using less texting and more talking!

All in all, my experience at Pythian has been nothing less than fantastic. It has truly been a pleasure to work with a diverse group of people from around the world, and I would be thrilled to see my future take me back there one day. If you’re looking to intern for a company with a global focus and an information-sharing, empowering culture, then you would definitely love to join Pythian!

Categories: DBA Blogs

Use Case for SQL Server Table Partitioning

Pythian Group - Mon, 2016-01-11 15:31


Often we are asked by our clients about Table Partitioning, and specifically, which tables will be good candidates to be partitioned?
Here are some of the main use cases for Table Partitioning:

  1. You have the Enterprise Edition of SQL Server 2005 or higher.
  2. The table contains or will contain more than 5-6 million rows and growing fast, or its size is growing by around 1GB per month or more.
  3. The FULL backup is taking too long or the backup file is too large and older data is not being updated (i.e.: users can only update data from the last 3 months).
  4. Data needs to be archived or purged on a regular basis, potentially the current archiving or deletion of data is causing blocks or deadlocks to other processes.
  5. There is a NOT NULL date column or another NOT NULL sequential column that the table can be partitioned upon.
  6. Better if most queries are including the partitioned column in WHERE clauses (i.e: between the date range).

When partitioning a table, here are few things to look into:

  1. Create a file and a filegroup per partition (even if the files are created in the same place). This way, it is easy to backup (i.e.: FILEGROUP backup), maintain and archive/purge.
  2. The best way to partition a table is by a date column because data is usually archived or purged by a date. If you do not have such a column, you may want to consider adding a column that will contain the current date/time when the row is created. This column can contain the default GETDATE(). Using an ID or a calculated column may cause too many headaches.
    If the application returns an error when adding the date column, consider using a view on top of the underlying partitioned table.
  3. Partitioning requires maintenance:
    1. To add files, filegroups and partitions on a regular basis and in advance.
    2. Monitor data growth and potentially modify the partition architecture (i.e.: move from a monthly partition to a weekly or daily partition).
    3. Archiving/purging partitions on a regular basis. Consider using a SWITCH partition for quick archiving and purging.
    4. You can defragment indexes per partition.
    5. Remember that statistics cannot be updated by partition, however, you may want to consider FILTERED INDEXES or FILTERED STATISTICS to avoid updating statistics on the entire table, as well as improving performance in specific cases.
  4. Consider using MAXDOP <> 1 on the instance level or for specific queries that span multiple partitions in order to take advantage of parallelism. Configure parallelism with caution.

Additional links:

There is a lot of information around SQL Server Table Partitioning. Here are some of the useful links:

SQL Server Database Partitioning Myths and Truths
Table Partitioning in SQL Server – The Basics
How To Decide if You Should Use Table Partitioning
Query Processing Enhancements on Partitioned Tables and Indexes


Discover more about our expertise in SQL Server.

Categories: DBA Blogs

Oracle RAC Scalability: Slides from Oracle OpenWorld 2015

Pythian Group - Fri, 2016-01-08 10:57

Finally getting some time to post the slides from Open World 2015. This is a presentation about Oracle RAC Scalability. The presentation talks about the type of challenges on how to scale an application horizontally via Oracle RAC. Generally speaking – it’s quite easy and mostly “works”. There are only a few very specific, but quite common things to address, and mostly about write-write contention.

Here are the slides:

Christo kutrovsky oracle rac solving common scalability problems from Christo Kutrovsky
Categories: DBA Blogs

Trying Python and Pyplot for Database Performance Graphs

Bobby Durrett's DBA Blog - Wed, 2016-01-06 16:16

In the past I have used Excel to graph things related to Oracle database performance. I am trying out Python and the Pyplot library as an alternative to Excel.  I took a graph that I had done in Excel and rewrote it in Python. The graph shows the CPU usage within the database by category.  For example, I labeled the database CPU used by a group of web servers “WEBFARM1” on the graph.

Here is an example graph:


You can find most of this code in the Python section of my GitHub repository. Here is the code that I used to create the example graph above using some made up data: zip

To make this graph in Excel I was running a sqlplus script and cutting and pasting the output into a text file that I imported into Excel. Very manual. No doubt there are ways that I could have automated what I was doing in Excel. But I have studied Python as part of the edX classes I took so I thought I would give it a try.

Python let me write a program to run the graph from an icon on my desktop. I used the cx_Oracle package to pull the data from the database and Pyplot for the graph.

I’m running the Windows 32 bit version of Canopy Express for my Python development environment. This environment comes with Pylot so I just had to install cx_Oracle to have all the packages I needed to make the graph.

I think both Excel and Python/Pyplot still have value. Excel still seems easier for quick and dirty graphing. But I used Python to automate a report that I run every day with fewer manual steps.  Probably could have done the same thing in Excel but I have recently studied Python so I was able to apply what I learned in my classes without a lot more effort.




Categories: DBA Blogs

Error when starting #GoldenGate Extract against MS SQL Server

DBASolved - Wed, 2016-01-06 16:10

If you work with Oracle GoldenGate long enough, you will eventually have to setup against a Microsoft SQL Server. Being that GoldenGate is a heterogeneous application, this isn’t a problem; however there are small differences. One such difference is how the exact/replicat will connect to the MS SQL Database.

In an Oracle-to-Oracle configuration, you would just use a command line the following from the command line:

GGSCI> dblogin useridalias [ alias name]
GGSCI> dblogin userid [ user name ] password [ password ]

In a MS SQL Server environment, you can still login at the GGSCI command prompt with the following:

GGSCI> dblgoin sourcedb [ dns ]

You will notice the difference, which is the use of an ODBC DNS entry. Although setting up the ODBC DNS entry is not the point of this post, just keep it in mind that is is required when connecting to MS SQL Server with Oracle GoldenGate.

After setting up the ODBC DNS, you will need to add the following to the extract/replicat parameter file to enable the process to connect to the database.

sourcedb [ dns ]

Note: I normally put my connection information in a macro to modularize my parameter files. Please it makes it easier if it needs to change.

MACRO #logon_settings
sourcedb [ dns ]

Now, when you go to start the extract/replicat, you may get the following error:

ERROR OGG-00551 Database operation failed: Couldn’t connect to [ dns ]. ODBC error: SQLSTATE 37000 native database error 4060. [Microsoft][SQL Server Native Client 11.0][SQL Server]Cannot open database “db_name” requested by the login. The login failed.

The error message is a little bit misleading. It tells you that the process cannot connect to the database which you were able to connect to from the GGSCI command prompt with no issue. Why is this? The issue lies in the fact that the manager (MGR) process is running as a service and does not have the correct permissions to access the database from the service.

In searching MOS for this error, I was found Note ID: 1633138.1. In this note, notice that this issue is known as of Oracle GoldenGate version 12.1.2.x.x. The note also provides you a fix to this issue. In simple terms, since the manager process is running as a service; additional permissions have to be granted to manger.

To grant the SYSADMIN privilege for the manager process follow the below sequence of steps (on windows after all):

1. Manager is installed as service:
Open SQL Server Management studio -> Security ->login>select NT AUTHORITY\SYSTEM ->Right Click -> Properties–>Server Role –>Enable sysadmin role

2. ggsci>stop mgr

3. ggsci>start mgr

4. ggsci>start extract <extract-name>

After granting the sysadmin role, the extract will start.


Filed under: Golden Gate
Categories: DBA Blogs

Exporting solutions from #GoldenGate Studio

DBASolved - Tue, 2016-01-05 23:21

In doing some testing with Oracle GoldenGate Studio, I decided to create a test solution that can be moved from studio-to-studio. In order to move the test solution from studio-to-studio, it has to be exported first. This post will be about how to export a solution so it can be archived or shipped to a co-worker.

To export a solution, you will start in the Projects window. After opening the project, you will see a little red puzzle piece under the “Solutions”.

By right-clicking on the solution name, you are presented with a context menu that provides a few options for dealing with solutions within Oracle GoldenGate Studio. The option you are interested in, is at the very bottom of the context menu. This is the export option.

After selecting the “export” option, studio will open a small wizard that allows you to provide information and options for the solution that is to be exported. Everything on the export screen can be edited; however, the only thing that should not be changed is the “Advanced Options”. Provide a directory where the export should reside and provide an encryption key (optional).

When everything is filled out as you want, click “ok” and the export will be done. At the end of the export, should be pretty quick, you will receive a message saying that the export completed.

Once the export is completed, you will find the XML file in the directory you specified in the export wizard. This XML file can be opened up with any text editor and reviewed. A sample of the XML content is provided below.

The beauty in this XML file is that everything created in studio is contained within it. This makes it every simple and easy to email to co-workers or others if they want to see the architecture being worked on. Making collaboration on GoldenGate architectures easier.


Filed under: Golden Gate
Categories: DBA Blogs

How to Deploy a Cluster

Pythian Group - Tue, 2016-01-05 12:15


In this blog post I will talk about how to deploy a cluster, the methods I tried and my solution to resolving the prerequisites problem.

I’m fairly new to the big data field. Learning about Hadoop, I kept hearing the term “clusters”, deploying a cluster, and installing some services on namenode, some on datanode and so on. I also heard about Cloudera manager which helps me to deploy services on my cluster, so I set up a VM and followed several tutorials including the Cloudera documentation to install cloudera manager. However, every time I reached the “cluster installation” step my installation failed. I later found out that there are several prerequisites for a Cloudera Manager Installation, which was the reason for the failure to install.


Deploy a Cluster

Though I discuss 3 other methods in detail, ultimately I recommend method 4, provided below.

Method 1:

I ran my very first MapReduce job on “labianchin/hadoop” Docker image. This was easy to use as it comes with Hadoop pre-installed. However it did not contain any other services. I manually installed Hive and Oozie, but soon ran into configuration issues.

Method 2:

A lot of people recommend using Cloudera’s quickstart-VM. Sadly my system did not meet the memory requirements, which led to me to look for other methods.

Method 3:  

Cloudera Manager installation on Google cloud. I will be discussing this method in detail below.

Method 4:

Cloudera Docker image. Released on December 1st, 2015. This is by far the quickest and easiest method to use Cloudera services. All services are pre-installed and pre-configured. I recommend the following by installing a single Google compute engine VM instance, as described below then install Docker, pull image and run the container.


Step-by-Step Guide to Cloudera Manager Installation on Google Cloud (Method 3)

I will be creating a cluster consisting of 4 nodes. 1 node will run Cloudera Manager and remaining 3 nodes will run services.

Create a Google compute engine VM instance:

  1. From the Developers console, under Resources, select ‘Compute Engine’.
  2. Click on new instance on top of the page.
  3. Enter an instance name.
  4. Select a time zone.
  5. Select machine type as ‘2vCPUs 13GB RAM n1-highmem-2’ with 80 GB disk and CentOS 6.7 image (Change disk space according to requirements).
  6. Create.
  7. When the instance is ready, select it and open in Edit Mode.
  8. Scroll down to ssh keys and enter your public key here. If you do not have a public key run the following commands or generate one using PuTTY:
    • ssh keygen -t rsa
    • cat ~/.ssh/
  9. Save.
  10. Create 3 clones of this instance.
  11. Start all 4 instances.
  12. Follow steps from “Repartition a root persistent disk” to expand your disk to allotted size [Repeat on all instances].


  • To allow nodes to SSH to each other, edit /etc/hosts file to include hosts from each node. Below is an example [Repeat on all Instances]:

View the code on Gist.

  • Change swappiness to minimum value without disabling it [Repeat on all instances]:

View the code on Gist.

  • Disable iptables:

View the code on Gist.

  • Disable redhat_transparent_hugepage:

View the code on Gist.

  • Install MySQL:

View the code on Gist.

  • Install Java:

View the code on Gist.

  • Download mysql-connector [Repeat on all instances]:

View the code on Gist.

  • Install Cloudera manger:

View the code on Gist.

  • Create databases and users:

View the code on Gist.

  • Update database name, user and password in ‘’:

View the code on Gist.

  • Start cloudera-server and observe logs until “Jetty Server started” appears. This may take a while:

View the code on Gist.

  • Access cloudera manager from the browser to complete installation:
    • Install PuTTY.
    • Open your private key file in PuTTY pageant. By default this should be located in C:/users/username/.ssh/filename.ppk . PageAnt icon will appear in the system tray.
    • Fill in external IP of VM instance (of node where cloudera server is running) as hostname in PuTTY.
    • From the column on right, go to SSH > tunnels.
    • Enter the internal IP of VM instance in the destination textbox with port 7180. E.g.
    • For ease of remembering ports, set the source port as 7180 (Same as destination port). You can choose to redirect to another port if 7180 is not available. 7180 is the default port for Cloudera manager.
    • Apply and Open the connection.
    • Open the browser and go to “localhost:7180”.
    • Proceed with cluster installation using Cloudera manager.

Cluster Installation:

  • Login as “admin” , “admin”.
  • Accept the Terms and Conditions. Then continue.
  • Select ‘Cloudera Express” or “Cloudera Enterprise Data Hub Edition Trial”. Then continue.
  • Search for your machine’s hostnames. e.g.


  • On the “Cluster Installation” page continue with all default settings.
  • On the “JDK Installation Options” page select “Install Oracle Java SE Development Kit (JDK)” and “Install Java Unlimited Strength Encryption Policy Files”. Continue.
  • Do not select “Single User Mode”. Continue
  • On “Provide SSH Login credentials” page, select Login to all hosts as ‘Another User’ with Authentication method ‘All hosts accept same private key’. Enter the username from SSH key that was added to GCE instance (This is the same user that logged in to PuTTY session). Select the private key file stored on your local machine. Continue without passphrase.
  • On the next page, the cluster installation may take a while. (NOTE: If you install Cloudera manager without following the prerequisites, installation will fail at this step).



  • Once the installation is complete, continue to install parcels and inspect hosts. Finally, continue to the Cluster Setup.


  • Select ‘Core Hadoop’ when asked to select CDH5 services.
  • When Assigning roles, I like to assign all ‘Cloudera Management Service’ roles to one node (in my case ‘cloudera-cm’) and distribute all other roles evenly on the remaining 3 nodes. Here is one possible assignment of roles:


  • On the Database Setup page, set all ‘Database Host Name’ fields to the node running the Cloudera-server. Enter Database name, Username and Passwords that were created in MySQL earlier.


  • Review the changes. Now the cluster will be setup and services deployed. This is the final step.


  • You are now ready to use services directly from console or access Hue on port 8888. Good Luck :)


P.S. I would like to thank Manoj Kukreja for showing me the right way of deploying clusters.


Discover more about our expertise in Big Data.

Categories: DBA Blogs

SQL Server 2016 – AlwaysOn Basic Availability Group

Pythian Group - Tue, 2016-01-05 11:46

BAG2Recently I was playing with the SQL Server 2016 CTP 2 and noticed that there was a new feature introduced – Basic Availability Group (BAG) for AlwaysOn. This feature is supported in SQL Server 2016 Standard or higher edition. With this feature we can create a fail-over solution for a single database, that means, if you want to add one or more database(s) it is not permitted. While you are doing a setup using the Basic Availability Group you will notice that you do not see an option that says “Readable Secondary” to “Yes”; this is because of the fact that Basic Availability Group replaces the Database Mirroring (DBM). And that means the secondary database can not be used to off-load the reporting traffic unlike you do in the SQL Server 2012 Enterprise edition.

This made me wonder. Why had MS made an improvement in the High Availability feature by adding BAG when it was already there? I re-read the feature lists and I found my answer – Standard edition of SQL Server 2012 does not have support for AlwaysOn Availability groups. Thus MS has added a support in the SQL Server 2016, and, it now allows a Single Database per Basic Availability Group. This also means that you can create multiple Basic Availability Groups, as per requirement. I can say this, because at the time of writing this I could create at least two such groups. Though, at this moment, I am unaware of how many such groups can be created.

To conclude, let me list the features and limitations for the Basic Availability Group (BAG) that I can think of at this moment.


  1. Supports Standard Edition and Higher.
  2. You can create BAG for single database.
  3. You can create multiple (At least two, while I am writing this post) BAG.













  1. Can have only one database.
  2. Limited to Primary and Secondary, meaning two replica only.
  3. No readable Replica, that means NO to offloading reporting to a secondary replica,
  4. Can not off-load backups as well.














Discover more about our expertise in SQL Server.

Categories: DBA Blogs

FREE webinar about Advanced #Oracle Database Administration

The Oracle Instructor - Tue, 2016-01-05 03:45

On behalf of Oracle University, I will be delivering a free webinar, Tuesday 26th January 2016:

Expect live demonstrations as the major content – I promise to keep the marketing part to a minimum :-)

Categories: DBA Blogs

Big Data Co-op Experience at Pythian

Pythian Group - Thu, 2015-12-31 11:40


I joined Pythian as a Big Data co-op for duration of 4 months. At the time I joined, I had only heard of Big Data as a buzzword in the technical community. So when I was selected for the job I was quite nervous and at the same time very excited and eager to learn.

During my first 2 weeks, I attended BORG sessions with department heads to learn about all departments and how company is run. My favourite one was the session with CEO Paul Vallée where he talked about how he started the company in his basement and his vision for the future. I was repeatedly hearing “Pythian hires the top 5% talent in the industry”, which was intimidating thinking that I’m surrounded by brilliant people and I didn’t know much. However, that feeling didn’t last very long once I got to know my coworkers. Everyone at Pythian is very helpful and there are several mediums for knowledge transfer among employees. Pythian is constantly hiring so if you think you’re new and feel hesitant at first, remember there will be someone newer than you in 2 weeks or less.

Some of the exciting events I encountered during my work-term included the Pythian days, Quarterly all hands meeting, Pythian Future Leaders session for career growth, the holiday party and of course the headquarter move from St-Laurent to Westboro. I also enjoyed the weekly Pythianology sessions which are a great way to learn about best practices for dealing with data.

Being part of the Big Data team, I had to learn to work in a distributed environment. I worked from the Ottawa headquarters while my team members worked remotely from other cities in Canada and worldwide. This is undoubtedly one of the most important skills I learned during my time at Pythian.

Coming to an end of my work-term, I wouldn’t say I’m an expert in Big Data and ready to deal with customers just yet, but I sure learned the type of challenges that customers face. I like to tackle problems on my own, so I strived to get things running with help from Google, Stackoverflow and the Pythian blog before I asked team members for help. Every now and then I would hit a roadblock in which case my team members would guide me in the right direction. An important lesson I learned is that theoretical knowledge is just as important as practical knowledge. As a programmer, I had trained myself to learn by practice so I wanted to get started with my project as quickly as I could. In doing so I skipped over some of the reading I should have focused on. Due to this I struggled with errors that could have been avoided and saved me a lot of time. In my experience, dealing with Big Data technologies is very different from programming (at least, at first). Before you can get to the programming stage, you need to deal with A LOT OF configurations. I’m happy to say that I not only learned about technologies directly related to Big Data like hadoop, MapReduce, Hive, Oozie, Hue, CDH etc. but learned a lot more. In the last 4 months I have dealt with at least 4 flavours of Operating Systems including Windows 7, windows 10, Ubuntu and CentOs. I have worked with Virtual Machines running locally on my system and Virtual machines running on the cloud. I worked mostly on Google Cloud but at some point for a brief period of time explored Amazon Web Services and Microsoft Azure services. I also played around with docker images which I found to be pretty cool and plan to keep using them in future.

I confess that if I had explored Big Data technologies on my own time, I would have given up very early on, but being at Pythian allowed me to stay committed to achieving my goal. I was allowed to learn at my own pace while experimenting and failing repeatedly. I got help from my team members when needed, which helped me to achieve my goal for the work-term. Being new to Big Data can be overwhelming seeing how fast the field is growing and hearing about the large variety of technologies there are. An important piece of advice I received is to not try and learn too much at once but take one technology at a time and understand it well. Having gotten this far, I’m motivated to dive in deeper and to explore things further to improve my skills.

I highly recommend students to join Pythian for coop terms for a chance to learn from industry leaders worldwide. At Pythian you will not be treated as an intern but as a full-time employee.  You will gain knowledge about technologies that are not covered in school. You will get an opportunity to not only learn technologies relevant to your area of interest, but develop transferable skills which will be beneficial in several fields.

Categories: DBA Blogs

MySQL Benchmark in the Cloud

Pythian Group - Thu, 2015-12-31 10:50


Testing functionalities and options for a database can be challenging at times, as a live production environment might be required. As I was looking for different options, I was directed by Derek Downey to this post in the Percona blog.

The blog discussed an interesting and fun tool from Percona, tpcc-mysql. I was interested in testing the tool so I decided to play around with it in an AWS EC2 server.

In this post I will expand on the Percona blog post, since the tool lacks documentation, as well as explain how I used it to create a MySQL Benchmark in AWS.

Why tpcc-mysql?

There are various reasons why tpcc-mysql could be a good option to use for a benchmarking project. The following points highlights most of them:


  • Mimics a full DB structure of a real warehouse.
  • Simulates a real life load on the server.
  • Options and flexibility.
  • Very light footprint on the system.


  • No documentation.
Getting the Server Started

You’ll probably need to launch a new EC2 server from the AWS Console, or use an existing one that you already have up an running. Either way, you had better save the current state of your database. Luckily, AWS EBS offers really good and convenient solution to achieve this.

It is possible to create and manage sanpshots of EBS volumes in the AWS Dashboard with some very basic steps. I personally prefer to setup the MySQL base and data directories together in a different volume from from the root volume. This allows me to swap between different versions and data-sets without having to reconfigure my tools every time I load a snapshot.


Writing a  good description helps when creating new volumes.


Possible suggestions come up as you start typing based on descriptions .

 Setting up the Benchmark

Once you have taken your snapshot and configured you MySQL, move on to setup. First we’ll need to setup the prerequisites.

tpcc-mysql uses mysql_config  is part of the libmysqlclient_dev package. We also need Bazaar. So we’ll go ahead and install that:

sudo apt-get install libmysqlclient_dev
sudo apt-get install bzr


Install & Compile spcc-mysql

Use following commands to download the tpcc-mysql source code and compile it:

bzr branch lp:~percona-dev/perconatools/tpcc-mysql
cd tpcc-mysql/src
make all


Prepare the Database & Create Required Tables

Once the the tpcc-mysql has been compiled, we will need to prepare the database for the benchmark. This will consist of running a few scripts to create the required database, tables, and generate random data to use during the testing process.

Following these steps will create the database and tables made for us, they are all part of the tpcc-mysql package:

cd ~/tpcc-mysql
# 1. Create Database to be load data in 
  mysql -u root -p -e "CREATE DATABASE tpcc1000;"
# 2. Create the required table definitions 
  mysql -u root -p tpcc1000 < create_table.sql
# 3. Add foreign keys and indexes  
  mysql -u root -p tpcc1000 < add_fkey_idx.sql

The following tables are created from the previous step:

$ mysql -u root -p tpcc1000 -e "SHOW TABLES;"
Enter password:
| Tables_in_tpcc1000 |
| customer           |
| district           |
| history            |
| item               |
| new_orders         |
| order_line         |
| orders             |
| stock              |
| warehouse          |

As you can see, tpcc-mysql mimics a warehouse’s database that tracks clients, items, orders, stock, … etc

Prepare the Database & Create Required Tables

The last step remaining before we can start our test is to populate some data into the tables. For that, tpcc-mysql has a script, tpcc_load, that does the job.

The tpcc_load script generates random dummy data in the tables created in the previous steps. The script also have a parameter that allows to specify how many warehouses you want to simulate.

The script usage is as follow:

tpcc_load [server] [DB] [user] [pass] [warehouse]

In our example, we’ll use the following:

./tpcc-mysql/tpcc_load tpcc1000 root "$pw" 2
Beginning the Benchmarking Process

This would be a good time to take a snapshot of your server/dataset, so you can come back to it. Also, before we get started, let’s get familiar with the script we need to use for starting the benchmarking process, tpcc_start. 

The script will start creating transactions that would execute various statements like SELECT, UPDATE, DELETE, and INSERT. The script will also be generating a detailed output of the progress and a summary in the end. You can redirect this output to a file to run some analysis, compare it later on, or use it to run an analysis.

The script comes with various parameters to give you flexibility to configure it as you desire:

tpcc_start -h[server] -P[port] -d[DB] -u[mysql_user] -p[mysql_password] -w[# of warehouses] -c[# of connections] -r[warmup_time] -l[running_time]

Now let’s get to the fun part!

We’ll be using the following command will start a simulation of warehouse transactions, and record the output in the file tpcc-output-01.log

./tpcc_start -h127.0.0.1 -dtpcc1000 -uroot -p -w2 -c16 -r10 -l1200 > ~/tpcc-output-01.log
Analyzing the Output

tpcc-mysql comes with different scripts that could be used for analysis. Check the tpcc-mysql/scripts folder. Example of some scripts are:

$ ls ~/tpcc-mysql/scripts/    

Visual Analysis of the Output

We can always take these tests a step further in many different directions. Since plotted data is a lot of fun, why not do a quick experiment with it?

The same blog post I used as my reference for this post also has a modified version of script that comes with tpcc-mysql. The script is named What this script does is that it extracts the time and # of transactions for each time block in a format that gnuplot can read for plotting the data. So let’s use the script on the output file:

./ tpcc-logs/tpcc-output-01.log tpcc-analyzed/time_tr_data_01.txt

To install gnuplot you simply run:

sudo apt-get install gnuplot

Then, we can create the plot using the  script (from here as well) as follows:

./ tpcc-analyzed/time_tr_data_01.txt tpcc-graphs/graph01.jpg

 And this generated the following plot for me:


I hope this was helpful. As you can see, there is a lot of potential of things that can be done using tpcc-mysql. If there is anything that you come up with or experiment with, I would love to hear it from you.


Discover more about our expertise in MySQL and the Cloud.

Categories: DBA Blogs

SQL On The Edge #6 – SQL AlwaysEncrypted

Pythian Group - Thu, 2015-12-31 10:42

Security is on everyone’s mind these days in the IT (and the real) world. Either because they’re dealing with compliance, risks or mitigation, etc. at work or because they just saw on the news yet another item about some big leak/breach happening. It is said that it’s not a question of if your systems will be attacked but when. As part of the SQL product family, Microsoft has now released a new feature called AlwaysEncrypted to continue risk mitigation and strengthen the security story of the product. And I mentioned the SQL ‘product family’ instead of just SQL Server because this feature is also available on Azure SQL Database.


What is it?
AlwaysEncrypted is the latest in the set of features that enables encryption inside SQL Server. Let’s look at the list so far:

  • Column level encryption
    This targets specific columns in specific tables, with the encryption/decryption happening at the server.
  • Transparent Database Encryption (A.K.A TDE): This targets entire databases and is transparent to the calling application. It’s also transparent to any user with proper access to the data.
  • AlwaysEncrypted: This also targets specific columns in specific tables, with the encryption/decryption happening ON THE CLIENT.

This is the big difference of this new feature, that the operations to encrypt/decrypt happen on the client NOT on SQL Server. That means that if your SQL Server is compromised, the key pieces to reveal the data are NOT with the server. This means that even if your DBA wants to see the data, if they don’t have access to the CLIENT application then they won’t be able to see the values.


How Does it Work?
This feature can be enabled through T-SQL or through a wizard in Management Studio. The actual data manipulation is done by the latest version of the ADO .NET client and during configuration, the client will read all of the data, perform the encryption and send it back to SQL Server for storage. The latest 4.6 release of the .NET framework is required. There’s a Column Master Key that will have to be stored in a Windows certificate store, Azure Key Vault or other 3rd party key storage software. During normal application operation, the ADO client will read this master key and use it to decrypt and encrypt the values.

There are two options for this type of encryption:

  1. Randomized
    This will make the same source values encrypt into DIFFERENT encrypted values. Useful for columns that could be correlated by looking at them and won’t be used for searching.
  2. Deterministic: This will make the same source values encrypt into the SAME encrypted values, thus allowing for indexing and searching.


For the demo, check the video below where we’ll use the SSMS Wizard to enable AlwaysEncrypted on a column and will show the decryption happening in SSIS using the ADO .NET client!



Discover more about our expertise in SQL Server.

Categories: DBA Blogs

How to Add Two Ethernet Interfaces with Azure VM

Pythian Group - Thu, 2015-12-31 09:55


Recently, working with my test VM on Azure, I needed a second network interface on my VM, but found no way to add one. I was using a standard A2 size Oracle Linux VM. I tried to search it in GUI interface and settings but could find no obvious way to add it. This surprised me so I continued to search and I found some blog posts on how to do it using Azure PowerShell. I discovered that there is no way to add the interface when the VM was already created and that you have to use Azure PowerShell for that. So, if you are a Mac user, as I am, and have only Azure CLI for Mac then you need to find a Windows box. I hope it will be fixed in future releases and that you will be able to manage networks from GUI or using any command line tool provided for Mac, Linux or any other platform. Thus, I will try to explain how you can create a new VM with 2 or more NIC.

First, we need to get a Windows machine to run Azure PowerShell commands. I created a small Windows box on Azure itself, explicitly to run the PowerShell when I need it. I’ve chosen basic A1 size Widows 2012 and installed the PowerShell there. It worked fine except you need to be careful if you use more than one monitor and rdp client for Mac. By default it was trying to use all monitors and in my case I got 1.5 screens (one was cut by half because it could not fit to my monitor). I removed check “Use all monitors” in my configuration for that connection in the rdp client. So, the first obstacle was resolved and I continued to work with the next steps.

Next, we will need “ImageName” to create a new machine. It can be checked using “Get-AzureVMImage” command. For Oracle linux it looks like:

PS C:\> Get-AzureVMImage | Where-Object { $_.Label -like "*Oracle Linux*" }
VERBOSE: 5:50:58 PM - Begin Operation: Get-AzureVMImage

ImageName : c290a6b031d841e09f2da759bbabe71f__Oracle-Linux-6-12-2014
OS : Linux
MediaLink :

Using the given ImageName we can now proceed. Keep in mind that you cannot create a VM with two or more NICs for an A2 size box. For two NICs you need at least a Large (A3) for 2 interfaces or an ExtraLarge(A4) if you need 4 NICs.
Let’s set up the image name:

PS C:\> $image = Get-AzureVMImage -ImageName "c290a6b031d841e09f2da759bbabe71f__Oracle-Linux-6-12-2014"

You need to setup your subscription ID for the session in PowerShell and a storage account:

PS C:\> Set-AzureSubscription -SubscriptionId "321265e2-ffb5-66f9-9e07-96079bd7e0a6" -CurrentStorageAccount "oradb5"

Create a custom config for our VM :

PS C:\> $vm = New-AzureVMConfig -name "oradb5" -InstanceSize "Large" -Image $image.ImageName
PS C:\> Add-AzureProvisioningConfig -VM $vm -Linux -LinuxUser "otochkin" -Password "welcome1"

I’ve created a virtual network “Multi-VNet” with two subnets for my VM and named the subnets as “Public” and “Private”. The Virtual network and subnets you can create in a GUI portal or using command line. I am going to use those subnets for my NICs.
Adding the first subnet to our VM configuration:

PS C:\> Set-AzureSubnet -SubnetName "Public" -VM $vm

Setting static IP for the network:

PS C:\> Set-AzureStaticVNetIP -IPAddress -VM $vm

Adding the second interface to the configuration.

PS C:\> Add-AzureNetworkInterfaceConfig -name "eth1" -SubnetName "Private" -StaticVNetIPAddress -VM $vm

And we can deploy our custom VM now:

PS C:\> New-AzureVM -ServiceName "test-1" -VNetName "Multi-VNet" -VM $vm
WARNING: No deployment found in service: 'test-1'.
VERBOSE: 6:59:03 PM - Begin Operation: New-AzureVM - Create Deployment with VM oradb5

OperationDescription OperationId OperationStatus
——————– ———– —————
New-AzureVM b7fcb2de-eac7-3684-aa8b-d1e9addc4587 Succeeded
VERBOSE: 7:01:08 PM – Completed Operation: New-AzureVM – Create Deployment with VM oradb5

The VM is created and you can check and connect to it. You don’t need your Windows box anymore and can shut it down to save money:

MacBook:~ otochkin$ azure vm list -v
info: Executing command vm list
verbose: Getting virtual machines
data: Name Status Location DNS Name IP Address
data: -------- ------------------ -------- --------------------- ----------
data: winman1 ReadyRole East US
data: oradb5 ReadyRole East US
info: vm list command OK
MacBook:~ otochkin$

You can connect to your freshly created VM and check the network:

[root@oradb5 ~]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:0D:3A:11:A3:71
inet addr: Bcast: Mask:
inet6 addr: fe80::20d:3aff:fe11:a371/64 Scope:Link
RX packets:337 errors:0 dropped:0 overruns:0 frame:0
TX packets:353 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:54281 (53.0 KiB) TX bytes:49802 (48.6 KiB)

eth1 Link encap:Ethernet HWaddr 00:0D:3A:11:AC:92
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

lo Link encap:Local Loopback
inet addr: Mask:
inet6 addr: ::1/128 Scope:Host
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

[root@oradb5 ~]#

And that’s it. In summary I can say that it is not difficult to create more than one interface on an Azure VM, but I think it can be a good addition to GUI (Azure portal). In my next blog post I will try to check other aspects of using Oracle and Linux on Microsoft Azure. Stay tuned.


Discover more about our expertise in the Cloud.

Categories: DBA Blogs

Good Habits & Gratitude

Pythian Group - Thu, 2015-12-31 09:40


One of my favourite books is The Power of Habit: Why We Do What We Do in Life and Business. With the New Year ahead and many of us focused on how we can continue to “up our game” by creating or changing our habits, it serves as a great read!

When most of us reflect on our habits, we tend to focus on our “bad” habits as opposed to the habits that are “good” or positive. I try to take a different approach by considering all the things that I do well, and incorporating those habits into my daily routine, including being grateful.

Two years ago I received the gift of a leather bound book. There were many use cases for the book, including note taking for work, collecting my favourite quotes, random sketches or jotting down ideas. I chose to use my book as a gratitude journal. At that time, a number of books about gratitude and the art of keeping a gratitude journal hit several bestseller lists. While I didn’t begin one then, I was keen on the new daily habit of documenting my gratitude.

As far as new habits go, this one was fairly easy to adopt:

  1. Find journal.
  2. Have journal in a convenient place.
  3. Pick a time of day to write in journal.
  4. Be grateful.
  5. Write it down.

My entries have covered everything from lessons I’ve learned, celebrating wins at work, special moments I’ve experienced, feelings I’ve felt, acknowledging good fortunes like health and wellness, etc. On days when I’m really pressed for time, I mindfully think about what I’m grateful for and log it in the next day. Sometimes the entries are short like a note about health, happiness, family, friends, a chocolate brownie, a great book, warm boots, etc.

This habit continues to help me remember and recognize the things that really matter to me. In times of stress or challenge my journal entries serve as a reminder that it’s important to take a few moments to find something to be grateful for. And while you don’t need a journal to be grateful, it’s wonderful to flip back and read what you were grateful for eight months ago, six weeks ago or even four days ago.

Today I’m grateful for the free speech that allows me to write this blog post, the winter tires on my car, the collective talents of the Pythian HR team, for the amazing 400+ colleagues in 36 countries who we get to work with every day and the opportunities that lay ahead for all of us in 2016.

What new habit(s) will you form in 2016?

Categories: DBA Blogs

Happy New Year 2016!

The Oracle Instructor - Thu, 2015-12-31 02:08

Another year has passed. I take the opportunity to thank you for visiting and to wish you a Happy New Year 2016!

Happy New Year 2016!

In case you didn’t recognize: That is supposed to look like fireworks, The Oracle Instructor style ;-)

2015 was a great year for with 345,000+ views and the crossing of the one million hits threshold. Top countries with more than 4,000 views in 2015 were

Visitors 2015

Visitors came from 202 countries, even China is on the list this year with 1,500+ views.

Hope to see all of you again here in 2016 :-)

Categories: DBA Blogs