Skip navigation.

Pythian Group

Syndicate content
Official Pythian Blog - Love Your Data
Updated: 9 hours 34 min ago

Making it Easier to Graph Your Infrastructure’s Performance Data

Tue, 2014-07-08 07:46

Today I would like to share a story with you about the development of a Puppet module for Grafana and its release on the Puppet Forge. But first, I’d like to provide some context so you can understand where Grafana fits in and why I feel this is important.

Those of you that know me have likely heard me talk about the importance of data-driven decision making, and more specifically some of the tools that can be used to help enable individuals to make smart decisions about their IT infrastructure. A common approach is to deploy a graphing system such as Graphite, which stores performance data about your infrastructure to aide you in performing a number of functions including problem diagnosis, performance trending, capacity planning, and data analytics.

If you are unfamiliar with the software, I’ll briefly describe its architecture. Graphite consists of a daemon, called carbon, which listens for time series data and writes it to a fixed-size database called whisper. It also provides a web application to expose the data and allow the user to create and display graphs on demand using a powerful API.

While Graphite does a good job of storing time series data and providing a rich API for visualizing it, one of the things it does not really focus on is providing a dashboard for the data. Thankfully we have Grafana to fill this role and it happens to do it quite well.

If you have ever worked with the ELK stack (Elasticsearch, Logstash, and Kibana) before, Grafana’s interface should be familiar to you, as it is based on Kibana. It is a frontend for Graphite or InfluxDB, and runs as a client side application in your browser. Its only (optional) external dependency is Elasticsearch, as it can use it to store, load and search for dashboards.

Below are some of Grafana’s most notable features (see its feature highlights for a more comprehensive list):

  • Dashboard search
  • Templated dashboards
  • Save / load from Elasticsearch and / or JSON file
  • Quickly add functions (search, typeahead)
  • Direct link to Graphite function documentation
  • Graph annotation
  • Multiple Graphite or InfluxDB data sources
  • Ability to switch between data sources
  • Show graphs from different data sources on the same dashboard

We like to make use of IT automation software whenever possible to deploy tools for our clients. Most tools already have Puppet modules or Chef cookbooks available for them, including the other components of the graphing system: Graphite itself, and a great Python-based collector named Diamond. Grafana, however, had no Puppet module available so I decided to rectify the situation by creating one and publishing it to the Puppet Forge.

The module would be pretty simple: all that is required is to download and extract Grafana into an installation directory, and ensure appropriate values for the Elasticsearch, Graphite and InfluxDB servers / data sources are inserted into its configuration.

I decided to offload the work of downloading and extracting the software to another module, namely gini/archive. And managing the configuration file, config.js, would be done with a combination of module parameters and ERB template.

The only real complication arose when it came time to test serving Grafana with a web server such as Apache or Nginx. I decided not to have my module manage the web server in any way, so I would leverage Puppet Labs’ own Apache module for this purpose.

My test environment consisted of a CentOS virtual machine provisioned by Vagrant and Puppet, with Graphite and Grafana on the same server. I decided to use Daniel Werdermann’s module to deploy Graphite on my virtual machine as it had worked well for me in the past.

I quickly ran into problems with duplicate resources, however, due to the Graphite module managing Apache for creation of its virtual host etc. I moved to separate virtual machines for Graphite and Grafana, and that made my life easier. If you do decide to run both pieces of software on the same server, and are also using Daniel’s module, you can work around the problem by setting gr_web_server to ‘none’ like this:

class { 'graphite':
  gr_web_server			=> 'none',
  gr_web_cors_allow_from_all	=> true,
}

Since my module does not manage Apache (or Nginx), it is necessary to add something like the following to your node’s manifest to create a virtual host for Grafana:

# Grafana is to be served by Apache
class { 'apache':
  default_vhost   => false,
}

# Create Apache virtual host
apache::vhost { 'grafana.example.com':
  servername      => 'grafana.example.com',
  port            => 80,
  docroot         => '/opt/grafana',
  error_log_file  => 'grafana-error.log',
  access_log_file => 'grafana-access.log',
  directories     => [
    {
      path            => '/opt/grafana',
      options         => [ 'None' ],
      allow           => 'from All',
      allow_override  => [ 'None' ],
      order           => 'Allow,Deny',
    }
  ]
}

And the Grafana declaration itself:

class { 'grafana':
  elasticsearch_host  => 'elasticsearch.example.com',
  graphite_host       => 'graphite.example.com',
}

Now that my module was working, it was time to publish it to the Puppet Forge. I converted my Modulefile to metadata.json, added a .travis.yml file to my repository and enabled integration with Travis CI, built the module and uploaded it to the Forge.

Since its initial release, I have updated the module to deploy Grafana version 1.6.1 by default, including updating the content of the config.js ERB template, and have added support for InfluxDB. I am pretty happy with the module and hope that you find it useful.

I do have plans to add more capabilities to the module, including support of more of Grafana’s configuration file settings, having the module manage the web server’s configuration similar to how Daniel’s module does it, and adding a stronger test suite so I can ensure compatibility with more operating systems and Ruby / Puppet combinations.

I welcome any questions, suggestions, bug reports and / or pull requests you may have. Thanks for your time and interest!

Project page: https://github.com/bfraser/puppet-grafana
Puppet Forge URL: https://forge.puppetlabs.com/bfraser/grafana

Categories: DBA Blogs

Pro-active AWR Data Mining to Find Change in SQL Execution Plan

Mon, 2014-07-07 11:11

Many times we have been called for the poor performance of a database and it has been narrowed down to a  SQL statement. Subsequent analysis have shown that the execution plan has been changed and a wrong execution plan was being used.

Resolution normally, is to fix the execution plan in 11g by running

variable x number
begin
:x :=
    dbms_spm.load_plans_from_cursor_cache(
    sql_id=>'&sql_id',
    plan_hash_value=>&plan_hash,
    fixed=>'YES');
end;
/

or for 10g, SQL_PROFILE is created as mentioned in Carlos Sierra’s blog .

A pro-active approach can be to mine AWR data for any SQL execution plan changes.

Following query from dba_hist_sqlstat can retrieve the list of SQL IDs whose plans have changed. It orders the SQL IDs,so that those SQL IDs for which maximum gains can be achieved by fixing plan, are listed first.

 
spool sql_with_more_than_1plan.txt
set lines 220 pages 9999 trimspool on
set numformat 999,999,999
column plan_hash_value format 99999999999999
column min_snap format 999999
column max_snap format 999999
column min_avg_ela format 999,999,999,999,999
column avg_ela format 999,999,999,999,999
column ela_gain format 999,999,999,999,999
select sql_id,
       min(min_snap_id) min_snap,
       max(max_snap_id) max_snap,
       max(decode(rw_num,1,plan_hash_value)) plan_hash_value,
       max(decode(rw_num,1,avg_ela)) min_avg_ela,
       avg(avg_ela) avg_ela,
       avg(avg_ela) - max(decode(rw_num,1,avg_ela)) ela_gain,
       -- max(decode(rw_num,1,avg_buffer_gets)) min_avg_buf_gets,
       -- avg(avg_buffer_gets) avg_buf_gets,
       max(decode(rw_num,1,sum_exec))-1 min_exec,
       avg(sum_exec)-1 avg_exec
from (
  select sql_id, plan_hash_value, avg_buffer_gets, avg_ela, sum_exec,
         row_number() over (partition by sql_id order by avg_ela) rw_num , min_snap_id, max_snap_id
  from
  (
    select sql_id, plan_hash_value , sum(BUFFER_GETS_DELTA)/(sum(executions_delta)+1) avg_buffer_gets,
    sum(elapsed_time_delta)/(sum(executions_delta)+1) avg_ela, sum(executions_delta)+1 sum_exec,
    min(snap_id) min_snap_id, max(snap_id) max_snap_id
    from dba_hist_sqlstat a
    where exists  (
       select sql_id from dba_hist_sqlstat b where a.sql_id = b.sql_id
         and  a.plan_hash_value != b.plan_hash_value
         and  b.plan_hash_value > 0)
    and plan_hash_value > 0
    group by sql_id, plan_hash_value
    order by sql_id, avg_ela
  )
  order by sql_id, avg_ela
  )
group by sql_id
having max(decode(rw_num,1,sum_exec)) > 1
order by 7 desc
/
spool off
clear columns
set numformat 9999999999

The sample output for this query will look like

SQL_ID        MIN_SNAP MAX_SNAP PLAN_HASH_VALUE          MIN_AVG_ELA              AVG_ELA             ELA_GAIN     MIN_EXEC     AVG_EXEC
------------- -------- -------- --------------- -------------------- -------------------- -------------------- ------------ ------------
ba42qdzhu5jb0    65017    67129      2819751536       11,055,899,019       90,136,403,552       79,080,504,532           12            4
2zm7y3tvqygx5    65024    67132       362220407       14,438,575,143       34,350,482,006       19,911,906,864            1            3
74j7px7k16p6q    65029    67134      1695658241       24,049,644,247       30,035,372,306        5,985,728,059           14            7
dz243qq1wft49    65030    67134      3498253836        1,703,657,774        7,249,309,870        5,545,652,097            1            2

MIN_SNAP and MAX_SNAP are the minimum/maximum snap id where the SQL statement occurs

PLAN_HASH_VALUE is the hash_value of the plan with the best elapsed time

ELA_GAIN is the estimated improvement in elapsed time by using this plan compared to the average execution time.

Using the output of the above query, sql execution plans can be fixed, after proper testing.  This method can help DBAs pin-point and resolve problems with SQL execution plans, faster.

Categories: DBA Blogs

Salt Stack for Remote Parallel Execution of Commands

Mon, 2014-07-07 11:08

There are many scenarios when a SysAdmin has to do a “box walk” of the entire infrastructure to execute a command across many servers. This is universally accepted as one of the less glamorous parts of our job. The larger the infrastructure, the longer these box walks take, and the greater chance that human error will occur.

Even giving this task to a junior resource, as is often the case, is not sustainable as the infrastructure grows, and does not represent the best value to the business in terms of resource utilization. Additionally, too much of this type of “grind” work can demoralize even the most enthusiastic team member.

Thankfully the days of having to do these box walks are over. Thanks to configuration management and infrastructure automation tools, the task has been automated and no longer requires the investment in time by a human SysAdmin that it once did. These tools allow you, at a very high level, to off load this repetitive work to the computer, with the computer doing the heavy lifting for you.

 

Introducing Salt Stack

Salt Stack is a distributed remote execution system used to execute commands and query data on remote nodes, either individually or by arbitrary selection criteria. Salt Stack is also a configuration management system in it’s own right but this post will be focusing on Salt from a “Command and Control” point of view.

Salt has 2 main components, the “salt master” (server) and the “salt minions” (clients). Once the minions are accepted by the master, then further execution of commands can come directly from the central salt master server.

Once you have installed your packages the minion needs to be configured to know where its master is. This can be accomplished through a DNS or hosts-file entry or by setting the variable in the /etc/salt/minion config.


master: XXX.XXX.XXX.XXX

Where “XXX.XXX.XXX.XXX” is the IP Address of your master server. Once that is done, and the salt-minion service has been started the minion will generate and ship an SSL key back to the master to ensure all communication is secure.

The master must accept the key from the minion before any control can begin.


# Listing the Keys

[root@ip-10-154-193-216 ~]# salt-key -L
Accepted Keys:
Unaccepted Keys:
ip-10-136-76-163.ec2.internal
Rejected Keys:

# Adding The Key

[root@ip-10-154-193-216 ~]# salt-key -A
The following keys are going to be accepted:
Unaccepted Keys:
ip-10-136-76-163.ec2.internal
Proceed? [n/Y] y
Key for minion ip-10-136-76-163.ec2.internal accepted.

# Nailed It! Now the Master can control the Minion!

[root@ip-10-154-193-216 ~]# salt-key -L
Accepted Keys:
ip-10-136-76-163.ec2.internal
Unaccepted Keys:
Rejected Keys:

Note: Not Shown – I added a 2nd Minion

Now that your master has minions the fun begins. From your master you can now query information from your minions such as disk space:


[root@ip-10-154-193-216 ~]# salt '*' disk.percent

ip-10-136-76-163.ec2.internal:
----------
/:
15%
/dev/shm:
0%
ip-10-147-240-208.ec2.internal:
----------
/:
14%
/dev/shm:
0%

And you can also execute remote commands such as finding out service status, and restarting services.


[root@ip-10-154-193-216 ~]# salt '*' cmd.run "service crond status"

ip-10-136-76-163.ec2.internal:
crond (pid 1440) is running...
ip-10-147-240-208.ec2.internal:
crond (pid 1198) is running...

[root@ip-10-154-193-216 ~]# salt '*' cmd.run "service crond restart"
ip-10-136-76-163.ec2.internal:
Stopping crond: [ OK ]
Starting crond: [ OK ]
ip-10-147-240-208.ec2.internal:
Stopping crond: [ OK ]
Starting crond: [ OK ]

These are only the most basic use cases for what Salt Stack can do, but even from these examples it is clear that salt can become a powerful tool which can reduce the potential for human error and increase the efficiency of your SysAdmin Team.

By Implementing Configuration Management and Infrastructure Automation tools such as Salt Stack you can free up the time of your team members to work on higher quality work which delivers more business value.

Salt Stack (depending on your setup) can be deployed in minutes. On RHEL/CentOS/Amazon Linux using the EPEL repo I was able to be up and running with Salt in about 5 minute on the 3 nodes I used for the examples in this post. Salt can be deployed using another configuration management tool, it can be baked into your provisioning environment, or into base images. If all else fails, (ironically) you can do a box walk to install the package on your existing servers.

Even if you have another configuration management solution deployed, depending on what you are trying to accomplish using Salt for parallel command execution rather then the Config Management system can often prove a much simpler and lightweight solution.

Salt is also a great choice in tools for giving other teams access to execute commands on a subset of boxes without requiring them to have shell access to all of the servers. This allows those teams to get their job done without the SysAdmin team becoming a bottle neck.

Categories: DBA Blogs

Log Buffer #378, A Carnival of the Vanities for DBAs

Fri, 2014-07-04 08:43

New technologies, new ideas, and new tips are forthcoming in abundance in numerous blog posts across Oracle, SQL Server, and MySQL. This Log Buffer Edition covers many of the salient ones.

Oracle:

Wither you use a single OEM and migrating to a new OEM or have multiple OEMs, the need to move templates between environments will arise.

Oracle Coherence is the industry’s leading in-memory data grid solution that enables applications to predictably scale by providing fast, reliable and scalable access to frequently used data.

Needless to say, some ATG applications are more complex than others.  Some ATG applications support a single site, single language, single catalog, single currency, have a single development staff, single business team, and a relatively simple business model.

The purpose of this article is to describe some of the important foundational concepts of ATG.

You can use Ops Center to perform some very complex tasks. For instance, you might use it to provision several operating systems across your environment, with multiple configurations for each OS.

SQL Server:

SSRS In a Flash – Level 1 in the Stairway to Reporting Services.

The “Numbers” or “Tally” Table: What it is and how it replaces a loop.

Arshad Ali demonstrates granular level encryption in detail and explains how it differs from Transparent Data Encryption (TDE).

There were many new DMVs added in SQL Server 2012, and some that have changed since SQL Server 2008 R2.

There are some aspects of tables in SQL Server that a lot of people get wrong, purely because they seem so obvious that one feels embarrassed about asking questions.

MySQL:

A much awaited release from the MariaDB project is now stable (GA) – MariaDB Galera Cluster 10.0.12.

Failover with the MySQL Utilities: Part 2 – mysqlfailover.

HowTo: Integrating MySQL for Visual Studio with Connector/Net.

Single database backup and restore with MEB.

Externally Stored Fields in InnoDB.

Categories: DBA Blogs

What’s New with Apps Password Change in R12.2 E-Business Suite ?

Wed, 2014-07-02 08:39

Apps password change routine in Release 12.2 E-Business Suite changed a little bit. We have now extra options to change password, as well as some manual steps after changing the password using FNDCPASS.

There is a new utility introduced called AFPASSWD. This utility unlike FNDCPASS wont require you to enter apps and system user password, and makes it possible to separate duties between database administrator and application administrator. In most cases both these roles are done by same DBA. But in large organizations, there may be different teams that manage Database and Application. You can read about different options available in AFPASSWD in EBS Maintenance guide.

Whether you use FNDCPASS or AFPASSWD to change the APPLSYS/APPS password, you must also perform some additional steps. This is because in R12.2, the old AOL/J connection pooling is replaced with Weblogic Connection Pool ( JDBC Datasource ).  Currently this procedure is not yet automated. It would be good, if this can be automated using some WLS scripting.

  • Shut down the application tier services
  • Change the APPLSYS password, as described for the utility you are using.
  • Start AdminServer using the adadminsrvctl.sh script from your RUN filesystem
  • Do not start any other application tier services.
  • Update the “apps” password in WLS Datasource as follows:
    • Log in to WLS Administration Console.
    • Click Lock & Edit in Change Center.
    • In the Domain Structure tree, expand Services, then select Data Sources.
    • On the “Summary of JDBC Data Sources” page, select EBSDataSource.
    • On the “Settings for EBSDataSource” page, select the Connection Pool tab.
    • Enter the new password in the “Password” field.
    • Enter the new password in the “Confirm Password” field.
    • Click Save.
    • Click Activate Changes in Change Center.
  • Start all the application tier services using the adstrtal.sh script.

I will be posting more of these What’s new with R12.2 articles in future. Post your experiences changing passwords in Oracle EBS in the comments section. I will happy to hear your stories and give my inputs

Categories: DBA Blogs

Essential Hadoop Concepts for Systems Administrators

Wed, 2014-07-02 08:38

Of course, everyone knows Hadoop as the solution to Big Data. What’s the problem with Big Data? Well, mostly it’s just that Big Data is too big to access and process in a timely fashion on a conventional enterprise system. Even a really large, optimally tuned, enterprise-class database system has conventional limits in terms of its maximum I/O, and there is a scale of data that outstrips this model and requires parallelism at a system level to make it accessible. While Hadoop is associated in many ways with advanced transaction processing pipelines, analytics and data sciences, these applications are sitting on top of a much simpler paradigm… that being that we can spread our data across a cluster and provision I/O and processor in a tunable ratio along with it. The tune-ability is directly related to the hardware specifications of the cluster nodes, since each node has processing, I/O and storage capabilities in a specific ratio. At this level, we don’t need Java software architects and data scientists to take advantage of Hadoop. We’re solving a fundamental infrastructure engineering issue, which is “how can we scale our I/O and processing capability along with our storage capacity”? In other words, how can we access our data?

The Hadoop ecosystem at it’s core is simply a set of RESTfully interacting Java processes communicating over a network. The base system services, such as the data node (HDFS) and task tracker (MapReduce) run on each node in the cluster, register with an associated service master and execute assigned tasks in parallel that would normally be localized on a single system (such as reading some data from disk and piping it to an application or script). The result of this approach is a loosely coupled system that scales in a very linear fashion. In real life, the service masters (famously, NameNode and JobTracker) are a single point of failure and potential performance bottleneck at very large scales, but much has been done to address these shortcomings. In principal, Hadoop uses the MapReduce algorithm to extend parallel execution from a single computer to an unlimited number of networked computers.

MapReduce is conceptually a very simple system. Here’s how it works. Given a large data set (usually serialized), broken into blocks (as for any filesystem) and spread among the HDFS cluster nodes, feed each record in a block to STDIN of a local script, command or application, and collect the records from STDOUT that are emitted. This is the “map” in MapReduce. Next, sort each record by key (usually just the first field in a tab-delimited record emitted by the mapper, but compound keys are easily specified). This is accomplished by fetching records matching each specific key over the network to a specific cluster node, and accounts for the majority of network I/O during a MapReduce job. Finally, process the sorted record sets by feeding the ordered records to STDIN of a second script, command or application, collecting the result from STDOUT and writing them back to HDFS. This is the “reduce” in MapReduce. The reduce phase is optional, and usually takes care of any aggregations such as sums, averages and record counts. We can just as easily pipe our sorted map output straight to HDFS.

Any Linux or Unix systems administrator will immediately recognize that using STDIO to pass data means that we can plug any piece of code into the stream that reads and writes to STDIO… which is pretty much everything! To be clear on this point, Java development experience is not required. We can take advantage of Linux pipelines to operate on very large amounts of data. We can use ‘grep’ as a mapper. We can use the same pipelines and commands that we would use on a single system to filter and process data that we’ve stored across the cluster. For example,

grep -i ${RECORD_FILTER} | cut -f2 | cut -d’=’ -f2 | tr [:upper:][:lower:]

We can use Python, Perl and any other languages with support configured on the task tracker systems in the cluster, as long as our scripts and applications read and write to STDIO. To execute these types of jobs, we use the Hadoop Streaming jar to wrap the script and submit it to the cluster for processing.

What does this mean for us enterprise sysadmins? Let’s look at a simple, high level example. I have centralized my company’s log data by writing it to a conventional enterprise storage system. There’s lots of data and lots of people want access to it, including operations teams, engineering teams, business intelligence and marketing analysts, developers and others. These folks need to search, filter, transform and remodel the data to shake out the information they’re looking for. Conventionally, I can scale up from here by copying my environment and managing two storage systems. Then 4. Then 8. We must share and mount the storage on each system that requires access, organize the data across multiple appliances and manage access controls on multiple systems. There are many business uses for the data, and so we have many people with simultaneous access requirements, and they’re probably using up each appliance’s limited I/O with read requests. In addition, we don’t have the processor available… we’re just serving the data at this point, and business departments are often providing their own processing platforms from a limited budget.

Hadoop solves this problem of scale above the limits of conventional enterprise storage beautifully. It’s a single system to manage that scales in a practical way to extraordinary capacities. But the real value is not the raw storage capacity or advanced algorithms and data analytics available for the platform… it’s about scaling our I/O and processing capabilities to provide accessibility to the data we’re storing, thereby increasing our ability to leverage it for the benefit of our business. The details of how we leverage our data is what we often leave for the data scientists to figure out, but every administrator should know that the basic framework and inherent advantages of Hadoop can be leveraged with the commands and scripting tools that we’re already familiar with.

Categories: DBA Blogs

Differences Between R12.1 and R12.2 Integration with OAM

Wed, 2014-07-02 08:37

With the revamp of technology stack in R12.2 of Oracle E-Business Suite (EBS) , the way we integrate Oracle Access Manager (OAM) has changed. R12.2 now is built on Weblogic techstack, which drastically changed how it integrates with Other Fusion Middleware Products like OAM

Here is a overview of Steps to configure OAM with EBS R12.1

  • Install Oracle HTTP Server ( OHS)  11g
  • Deploy & Configure Webgate on OHS 11g
  • Install Weblogic
  • Deploy & Configure Accessgate on Weblogic
  • Integrate Webgate, Accessgate with EBS and OAM/OID

R12.2 has both OHS and Weblogic built-in. So we no longer have to Install OHS and Weblogic for Webgate and Accessgate. All we have to do is Deploy and Configure Webgate and Accessgate.  Webgate is deployed on top of R12.2 OHS 11g home. Accessgate is deployed as a separate managed server ( oaea_server1 )  on top of R12.2 weblogic.

Here is the pictorial version of the same

R12.1 and 11i EBS integration with OAM/OID

11iand12_Reference_Architecture

 

R12.2 Integration with OAM/OID

12.2_Reference_Architecture_2

Basically R12.2 reduces the number of moving parts in the OAM integration EBS. It saves DBAs lot of time, as it reduces the number of servers to manage.

References:

Integrating Oracle E-Business Suite Release 12.2 with Oracle Access Manager 11gR2 (11.1.2) using Oracle E-Business Suite AccessGate (Doc ID 1576425.1)

Integrating Oracle E-Business Suite Release 12 with Oracle Access Manager 11gR2 (11.1.2) using Oracle E-Business Suite AccessGate (Doc ID 1484024.1)

Images are courtesy of Oracle from note “Overview of Single Sign-On Integration Options for Oracle E-Business Suite (Doc ID 1388152.1)”

 

Categories: DBA Blogs

vCPU sharing in EC2: HVM to the rescue?

Fri, 2014-06-27 07:51

I’ve been doing some testing to clarify what a vCPU in Amazon Web Services actually is. Over the course of the testing, I experienced inconsistent results on a 2-thread test on a 4-vCPU m3.xlarge system, due to the mislabeling of the vCPUs as independent single-core processors by the Linux kernel. This issue manifests itself in a CPU-bound, multithreaded workload where there is idle CPU time.

My test environment used a paravirtualized (PV) kernel, which moves some of the virtualization logic into the Linux kernel, reducing the need for high-overhead hardware emulation. One drawback is that the kernel cannot be modified to, for example, resolve the CPU mislabeling. But there is an alternative: an HVM system relying on virtualization extensions in the CPU hardware and allowing custom kernels or even non-Linux operating systems to run. Historically the drawback has been a performance hit, though I read a very interesting post from Brendan Gregg’s blog, indicating that what’s called HVM in Amazon EC2 is actually a hybrid of PV and HVM, combining aspects of both. A test run by Phoronix on EC2 showed HVM performance on par with PV, and in some cases even better. So it definitely seems worth repeating my earlier tests on.

As before, I fire up an instance, but this time using the latest HVM Amazon Linux image:

$ ec2-describe-images ami-76817c1e -H
Type    ImageID Name    Owner   State   Accessibility   ProductCodes    Architecture    ImageType       KernelId        RamdiskId Platform        RootDeviceType  VirtualizationType      Hypervisor
IMAGE   ami-76817c1e    amazon/amzn-ami-hvm-2014.03.2.x86_64-ebs        amazon  available       public          x86_64  machine                           ebs     hvm     xen
BLOCKDEVICEMAPPING      /dev/xvda               snap-810ffc56   8
$ ec2-run-instances ami-76817c1e -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION     r-a4f480da      462281317311    default
INSTANCE        i-c5d5b6ef      ami-76817c1e                    pending marc-aws        0               m3.xlarge       2014-06-23T19:02:18+0000  us-east-1d                              monitoring-disabled                                     ebs                                       hvm     xen             sg-5fc61437     default

Checking in on CPUs:

[ec2-user@ip-10-145-187-117 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

It’s the same 2.6GHz E5-2670 processor, but is reported as a single-socket non-hyperthreaded quad-core processor. Not yet the dual-core hyperthreaded processing we’re getting though.

Time to run a few tests.

[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0 $$
pid 1768's current affinity list: 0-3
pid 1768's new affinity list: 0
[ec2-user@ip-10-145-187-117 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 18.1955 s, 119 MB/s
[ec2-user@ip-10-145-187-117 ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 36.4968 s, 59.5 MB/s
2170552320 bytes (2.2 GB) copied, 36.506 s, 59.5 MB/s

In the same range as with PV, but also 1-2% slower, meaning we’re seeing a small amount of HVM overhead. Let’s try across processors

[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0,1 $$
pid 1768's current affinity list: 0
pid 1768's new affinity list: 0,1
[ec2-user@ip-10-145-187-117 ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 27.8401 s, 78.0 MB/s
2170552320 bytes (2.2 GB) copied, 27.8398 s, 78.0 MB/s
[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0,2 $$
pid 1768's current affinity list: 0,1
pid 1768's new affinity list: 0,2
[ec2-user@ip-10-145-187-117 ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 18.1849 s, 119 MB/s
2170552320 bytes (2.2 GB) copied, 18.2014 s, 119 MB/s

Again, a tiny bit slower than with PV. To test variability, I’ll kick off 20 consecutive runs, and print a histogram of output:

[ec2-user@ip-10-145-187-117 ~]$ taskset -pc 0-3 $$
pid 1768's current affinity list: 0,2
pid 1768's new affinity list: 0-3
[ec2-user@ip-10-145-187-117 ~]$ for run in {1..20}; do
>  for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
>  wait
> done
...
[ec2-user@ip-10-145-187-117 ~]$ cat output | awk '/bytes/ {print $8,$9}' | sort -n | uniq -c
      1 113 MB/s
      3 114 MB/s
      4 115 MB/s
      6 116 MB/s
     10 117 MB/s
     10 118 MB/s
      6 119 MB/s

Running between 113 and 119 MB/s per thread: much less variability than before. In chart form:
aws-cpu-hvm

Looking at “top”:

[ec2-user@ip-10-145-187-117 ~]$ cat > ~/.toprc <<-EOF
> RCfile for "top with windows"           # shameless braggin'
> Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
> Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
>         winflags=25913, sortindx=10, maxtasks=2
>         summclr=1, msgsclr=1, headclr=3, taskclr=1
> Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
>         winflags=62777, sortindx=0, maxtasks=0
>         summclr=6, msgsclr=6, headclr=7, taskclr=6
> Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
>         winflags=62777, sortindx=13, maxtasks=0
>         summclr=5, msgsclr=5, headclr=4, taskclr=5
> Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
>         winflags=62777, sortindx=4, maxtasks=0
>         summclr=3, msgsclr=3, headclr=2, taskclr=3
> EOF
[ec2-user@ip-10-145-187-117 ~]$ top -b -n20 -U ec2-user

top - 20:31:51 up 28 min,  2 users,  load average: 1.37, 1.17, 0.63
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 22.9%us,  0.3%sy,  0.0%ni, 76.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 74.0%us,  3.0%sy,  0.0%ni, 23.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 59.7%us,  4.0%sy,  0.0%ni, 36.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 33.7%us,  2.7%sy,  0.0%ni, 63.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1951 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:08.92 gzip
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:08.92 gzip                                                         

top - 20:31:54 up 28 min,  2 users,  load average: 1.37, 1.17, 0.63
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 72.3%us,  4.3%sy,  0.0%ni, 23.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 94.4%us,  5.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 21.3%us,  2.0%sy,  0.0%ni, 76.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:11.84 gzip
 1951 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:11.83 gzip                                                         

top - 20:31:57 up 28 min,  2 users,  load average: 1.34, 1.17, 0.64
Tasks:  82 total,   3 running,  79 sleeping,   0 stopped,   0 zombie
Cpu0  : 95.3%us,  4.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 92.4%us,  7.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1951 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:14.74 gzip
 1953 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:14.75 gzip                                                         

top - 20:32:00 up 28 min,  2 users,  load average: 1.32, 1.17, 0.64
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 29.9%us,  1.7%sy,  0.0%ni, 68.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 63.0%us,  3.7%sy,  0.0%ni, 33.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 40.5%us,  2.3%sy,  0.0%ni, 57.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 55.3%us,  3.7%sy,  0.0%ni, 41.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1951 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:17.66 gzip
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:17.67 gzip

We see that work is split between adjacent CPUs, but that the scheduler is doing a good job of keeping the adjacent CPUs near 100% usage between them.

So based on these tests, it looks like, even though the CPU is still mislabeled, HVM has almost entirely avoided the issue of variability due to shared-core scheduling, at the cost of a small reduction in overall throughput.

Categories: DBA Blogs

Log Buffer #377, A Carnival of the Vanities for DBAs

Fri, 2014-06-27 07:49

Improvement and progress are all about growth. Pythian just recently grew more by entering into an agreement to acquire Blackbird.io as announced by Paul Vallee. This Log Buffer also adds one more edition to its growing history.

Oracle:

WebLogic 12.1.3 is a major step forward to becomes SOA Suite 12c, BPM Suite 12c and all the other Fusion Middleware Components will run on it!

The Internet of Things is changing everything. From the way you start your day in the morning, to the way products are manufactured and cities operate their services, machines all around you will collect data, share it with each other, and use it to recommend and even make decisions along the way.

Pivotal Cloud Foundry Installed lets create an ORG / USER to get started

An in-line sub process can best be thought of as a callable scope. It is a scope that is not part of the normal process flow, but rather a unit of encapsulated logic that can be called zero, one or multiple times from anywhere within the BPEL process

Do you need to create PDF reports from PL/SQL?

SQL Server:

Unique indexes are the database developer’s responsibility. Non-unique indexes can be more easily maintained directly on the production database by an automated process.

Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of software delivery and deployment by making build, integration and delivery into a routine.

When you are writing TSQL code there are times when you want to perform the same logic over and over again. To accomplish this you can used the WHILE keyword.

What you need for a Multi Subnet Configuration for AlwaysOn in SQL Server 2012

All the wonderful functionality that in-memory tables and natively compiled procedures provide in SQL Server 2014 is pretty cool. But, changes to core of the engine results in changes in things that we may have developed a level of comfort with. Grant Fritchey explains.

MySQL:

Optimistic updates for Edit Data operations in MySQL for Excel

Here are a few videos those of you who are just getting started with Sphinx may find useful.

Harnessing the power of master/slave clusters to operate data-driven businesses on MySQL

Every SELECT from your Python program may acquire a metadata lock!

Why %util number from iostat is meaningless for MySQL capacity planning

Categories: DBA Blogs

Welcome to Blackbird.io Employees and Clients

Thu, 2014-06-26 11:29

Today, we announced that Pythian has entered into an agreement to acquire Blackbird.io, itself the result of a recent merger between PalominoDB and DriveDev.

I want to start with a hearty welcome to the 40+ new esteemed collaborators joining our firm today. Simultaneously, I want to welcome Blackbird.io’s valued clients to the Pythian family.

I am looking forward to cultivating a long-lasting collaboration and friendship with each one of you, many of whom I have already counted as friends for years.

To that point, I want to highlight my longstanding friendship and collaboration with Laine Campbell, the CEO of Blackbird.io. I first met Laine in 2007 and was impressed by her intelligence, her energy, her charisma and, most of all, her remarkable passion for doing the right thing by her team, her clients, and her community.

In February 2008, I sent Laine an email with the subject “Join us?”, the most important line of which was “I’m looking for a founder for a new office in the Bay Area.”

Laine was gracious in her reply: “At this point, I’m absolutely packed with long-term clients.  I’m quite comfortable with revenue and challenge and location.  I really am flattered you’d consider me for the position, but I’m going to have to pass.” That was only about a year after she had founded PalominoDB.

Laine and I have been friends ever since and have made a discipline of trading notes and advice about our respective businesses.

As we fast-forward six years to the present, Laine and her team have achieved what many might have thought impossible. Out of thin air, with no venture capital and in only eight short years, Blackbird.io is a business eerily reminiscent of Pythian in 2008… a feat that took us 11 years.

Earlier this year, PalominoDB joined forces with DriveDev, itself a highly successful DevOps business transformation company founded in 2007 to create Blackbird.io. Blackbird.io delivers a coherent and differentiated vision that helps transform businesses through breakthrough velocity, availability, security, performance, and cost.

In what has to be one of the longest corporate romances our niche has known, Laine reached out to me in May indicating that she’d like to accept my original offer and join forces with us. It was my turn to be flattered and go through a week’s soul searching.  I was not alone in the exercise. A lot of soul searching, strategic thinking, and sheer hard work has gone into this announcement today. By the end of our efforts, it became clear that joining forces would dramatically accelerate our ability to reshape the enterprise IT services landscape.

I would like to specifically thank Laine Campbell, Aaron Lee, and Vicki Vance as owners of Blackbird.io for their courage, vision, and determination through these demanding weeks. On the Pythian side, I would like to especially thank  Andrew Waitman, without whom this deal would be impossible to contemplate, Alain Tardif and Siobhan Devlin, and the rest of the executive team at Pythian who’ve moved mountains on our behalf to make it real. I don’t want to forget to highlight as well the external support of Bob Ford at Kelly Santini and our financing partners.

We have months of hard work ahead of us integrating our businesses. It’s our goal and imperative to listen and learn from each other, and pick and choose the best parts of each respective business as we weave a coherent and integrated whole. This will be the first meaningful merger Pythian undertakes.

Together we are almost 350 strong and are home to the industry’s largest open-source database managed services capability. Together we will accelerate the adoption of Enterprise DevOps and help countless SaaS, retail, media, and online businesses leave their competitors in the dust. And that is a vision worth getting excited about.

Categories: DBA Blogs

Oracle Database: Query to List All Statistics Tables

Wed, 2014-06-25 08:00

If you were a big fan of manual database upgrade steps, perhaps you would have come across this step many times in your life while reading MOS notes, upgrade guides, etc.

Upgrade Statistics Tables Created by the DBMS_STATS Package
If you created statistics tables using the DBMS_STATS.CREATE_STAT_TABLE procedure, then upgrade these tables by executing the following procedure:
EXECUTE DBMS_STATS.UPGRADE_STAT_TABLE(‘SYS’,’dictstattab’);

In my experience, I found the statistics tables can be created from Oracle rdbms version 8i. So this step became part of the database upgrade documents until now. I also noticed the structure of the statistics table was the same until 10gR2 version, but Oracle had modified the structure marginally on 11g and 12c versions.

I have been using this single query to list all statistics tables that exist on a database, which can be still used despite changes on the table structure.

SQL> select owner,table_name from dba_tab_columns where COLUMN_NAME=’STATID’ AND DATA_TYPE= ‘VARCHAR2′;

Though this is not a critical step, it is required as a part of the post upgrade. Here is the small action plan to run the required command to upgrade all statistics tables.

Connect as SYS database user and run these steps:
SQL> set pages 1000
SQL> set head off
SQL> set feedback off
SQL> spool /home/oracle/stattab_upg.sql
SQL> select ‘EXEC DBMS_STATS.UPGRADE_STAT_TABLE(”’||owner||”’,”’||table_name||”’);’ from dba_tab_columns where COLUMN_NAME=’STATID’ AND DATA_TYPE= ‘VARCHAR2′;
SQL> spool off
SQL> @/home/oracle/stattab_upg.sql
SQL> exit

Categories: DBA Blogs

Virtual CPUs with Amazon Web Services

Tue, 2014-06-24 15:41

Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?

I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).

Investigating for ourselves

I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:

$ ec2-describe-images ami-fb8e9292 -H
Type    ImageID Name    Owner   State   Accessibility   ProductCodes    Architecture    ImageType       KernelId        RamdiskId Platform        RootDeviceType  VirtualizationType      Hypervisor
IMAGE   ami-fb8e9292    amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon  available       public          x86_64  machine aki-919dcaf8                      ebs     paravirtual     xen
BLOCKDEVICEMAPPING      /dev/sda1               snap-b047276d   8

Launching the instance:

$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION     r-cde66bb3      462281317311    default
INSTANCE        i-b5f5a2e6      ami-fb8e9292                    pending marc-aws        0               m3.xlarge       2014-06-16T20:23:48+0000  us-east-1d      aki-919dcaf8                    monitoring-disabled                              ebs                                      paravirtual     xen             sg-5fc61437     default

The instance is up and running within a few minutes:

$ ec2-describe-instances i-b5f5a2e6 -H
Type    ReservationID   Owner   Groups  Platform
RESERVATION     r-cde66bb3      462281317311    default
INSTANCE        i-b5f5a2e6      ami-fb8e9292    ec2-54-242-182-88.compute-1.amazonaws.com       ip-10-145-209-67.ec2.internal     running marc-aws        0               m3.xlarge       2014-06-16T20:23:48+0000        us-east-1d      aki-919dcaf8                      monitoring-disabled     54.242.182.88   10.145.209.67                   ebs                      paravirtual      xen             sg-5fc61437     default
BLOCKDEVICE     /dev/sda1       vol-1633ed53    2014-06-16T20:23:52.000Z        true

Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:

[ec2-user@ip-10-7-160-199 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1

Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.

Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.

A simple test

The simplest case: a single thread, on processor 0:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0 $$
pid 1531's current affinity list: 0-3
pid 1531's new affinity list: 0
[ec2-user@ip-10-7-160-199 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s

With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.

[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s
Sharing those cores

Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s

We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)

Trying the exact same test, but this time, non-adjacent processors 0 and 2:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s

All the way up to full-speed, showing dedicated cores.

What does this all mean? Let’s go back to the Amazon’s vCPU definition

Each vCPU is a hyperthreaded core

As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.

A side effect: inconsistent performance

There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:

[ec2-user@ip-10-7-160-199 ~]$ grep 'core id' /proc/cpuinfo
core id         : 0
core id         : 0
core id         : 0
core id         : 0

This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s

We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:

[ec2-user@ip-10-7-160-199 ~]$ cat > ~/.toprc <<-EOF
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=25913, sortindx=10, maxtasks=2
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[ec2-user@ip-10-7-160-199 ~]$ top -b -n10 -U ec2-user
top - 21:07:50 up 43 min,  2 users,  load average: 0.55, 0.45, 0.36
Tasks:  86 total,   4 running,  82 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.7%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  1.4%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.3%hi,  0.0%si,  0.3%st
Cpu2  : 96.0%us,  4.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.7%hi,  0.0%si,  0.3%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1766 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:06.08 gzip
 1768 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:06.08 gzip

Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:

top - 21:07:53 up 43 min,  2 users,  load average: 0.55, 0.45, 0.36
Tasks:  86 total,   4 running,  82 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.3%us,  3.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 96.0%us,  3.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.0%si,  0.3%st
Cpu3  :  0.3%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1766 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:09.08 gzip
 1768 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:09.08 gzip

Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.

This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.

Here’s a chart summarizing the results:

aws-cpu Summing up

Over the course of the testing I’ve learned two things:

  • A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
  • The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.

Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.

Categories: DBA Blogs

Working in Pythian’s Advanced Technology Consulting Group

Mon, 2014-06-23 08:22

Before I joined Pythian, I had the fortune of having a lot of good jobs across various industries. My favorite jobs were the ones that were fast paced and required me to ramp up my skills on the fly while learning new technology. My least favorite jobs were the ones where my skills were not used and the challenges were few and far between. When I joined Pythian I hadn’t realized I found my first great job.

In April 2012, I joined Pythian’s Professional Consulting Group (PCG). The members of PCG represented some of the world’s leading data experts, but the name did not adequately represent the skills of the members. Members of PCG were experts in many complementary technologies and many, if not all, were quickly becoming experts in emerging technologies such as Big Data. Because of this, the Professional Consulting Group became the Advanced Technology Consulting Group (ATCG).

As a member of ATCG, my main responsibility is to deliver consulting services to our customers either on site or remotely. Examples of some of the work we might do include: troubleshooting performance problems, migrating databases into Exadata, setting up replication with Oracle GoldenGate, and data integration with numerous sources using Oracle Data Integrator. While all of the items I mentioned deal with Oracle technologies, ATCG also has members who specialize in Microsoft SQL Server and MySQL.

The services we provide to our customers do not stop at traditional database services, ATCG also delivers Big Data services using Hadoop. Examples of some of the Hadoop work I have been involved with include: installing and configuring Cloudera Hadoop, securing Hadoop with Kerberos, and troubleshooting performance. As you can see, ATCG has the opportunity to gain valuable experience across a broad range of technologies.

All of our projects begin with a call with the potential customer. Members of ATCG serve as a technical resource on the call. It is our responsibility to understand the customer’s issue and estimate the effort required to perform the work. Sometimes this can be challenging because the customer might not have a technical resource on their side who can articulately convey the issue. Even if there is a technical resource on the customer’s side, we have to be mindful to not alienate others on the call, so it is vitally important that we are able to convey our message in way everybody on the call can understand.

You might be thinking “I am not a salesperson!” and “I have never used some of these technologies.” You would not be alone. ATCG are not sales people, we simply assist Sales by providing our technical knowledge on a call. Imagine that you are speaking with your boss or customer about a problem or issue – It really is no different. Dealing with new technology is little different at Pythian from your current job; If you don’t understand something, you can talk to a few coworkers or research on the net at your current job. At Pythian we can reach out to 200+ coworkers and find quite a few who have experience with the technology in question. We can search our internal technical documents, which are quite vast as they detail all of the work we have done, and as a last resort we can search the net. At Pythian, you are never alone and you are never without resources.

There are times when we might not have a project to work on, a.k.a. downtime. During our downtime, we can build our knowledge of technologies that we have interest in or that we may need a refresher for. We can practice our new found knowledge assisting other teams. We can help build the Pythian knowledge base by posting blogs and contributing to our internal documentation.

The work in ATCG is very challenging and you are always learning something new, whether it is a new technology or a new way of thinking about a particular topic. Being bored or pigeonholed is not a problem in ATCG; we are involved in some of toughest problems and work with the latest technologies. And when we are not, we are in control of our workday so we can pursue interests in new and emerging database technologies.

Categories: DBA Blogs

Ambari Blueprints and One-Touch Hadoop Clusters

Fri, 2014-06-20 11:11

For those who aren’t familiar, Apache Ambari is the best open source solution for managing your Hadoop cluster: it’s capable of adding nodes, assigning roles, managing configuration and monitoring cluster health. Ambari is HortonWorks’ version of Cloudera Manager and MapR’s Warden, and it has been steadily improving with every release. As of version 1.5.1, Ambari added support for a declarative configuration (called a Blueprint) which makes it easy to automatically create clusters with many ecosystem components in the cloud. I’ll give an example of how to use Ambari Blueprints, and compare them with existing one-touch deployment methods for other distributions.

Why would I want that?

I’ve been working on improving the methodology used by the Berkeley Big Data Benchmark. Right now spinning up the clusters is a relatively manual process, where the user has to step through the web interfaces of Cloudera Manager and Ambari, copy-paste certificates and IPs, and assign roles to nodes. The benchmark runs on EC2 instances, so I’ve been focused on automatic ways to create clusters on Amazon:

  • Apache Whirr can create a Hadoop cluster (or a number of other Big Data technologies), including CDH5, MapR and HDP. Documentation is sparse, and there doesn’t appear to be support for installing ecosystem projects like Hive automatically.
  • Amazon EMR supports installing Hive and Impala natively, and other projects like Shark via bootstrap actions. These tend to be older versions which aren’t suitable for my purposes.
  • MapR’s distribution is also available on EMR, but I haven’t used that since the different filesystem (MapRFS vs. HDFS) would impact results.

Hive-on-Tez is only supported on HDP at the moment, so it’s crucial that I have a one-touch command to create both CDH5 clusters, but also HDP clusters. Ambari Blueprints provide a crucial piece of the solution.

The Blueprint

Blueprints themselves are just JSON documents you send to the Ambari REST API. Every Ambari Blueprint has two main parts: a list of “host groups”, and configuration.

Host Groups

Host groups are a set of machines with the same agents (“components” in Ambari terms) installed – a typical cluster might have host groups for the NameNode, SecondaryNameNode, ResourceManager, DataNodes and client nodes for submitting jobs. The small clusters I’m creating have a “master” node group with the NameNode, ResourceManager, and HiveServer components on a single server and then a collection of “slaves” running the NodeManager and DataNode components. Besides a list of software components to install, every host group has a cardinality. Right now this is a bit of a pain, since the cardinality is exact: your blueprint with 5 slave nodes must have 5 slaves Hopefully the developers will add an option for “many”, so we don’t have to generate a new blueprint for every different sized cluster.  Thanks to John from HortonWorks for a correction, cardinality is an optional hint which isn’t validated by Ambari. This wasn’t clear from the docs.

To provide a concrete example, the sample host groups I’m using look like this:

"host_groups" : [
 {
 "name" : "master",
 "components" : [
 {
 "name" : "NAMENODE"
 },
 {
 "name" : "SECONDARY_NAMENODE"
 },
 {
 "name" : "RESOURCEMANAGER"
 },
 {
 "name" : "HISTORYSERVER"
 },
 {
 "name" : "ZOOKEEPER_SERVER"
 },
 {
 "name" : "HIVE_METASTORE"
 },
 {
 "name" : "HIVE_SERVER"
 },
 {
 "name" : "MYSQL_SERVER"
 }
 ],
 "cardinality" : "1"
 },
{
 "name" : "slaves",
 "components" : [
 {
 "name" : "DATANODE"
 },
 {
 "name" : "HDFS_CLIENT"
 },
 {
 "name" : "NODEMANAGER"
 },
 {
 "name" : "YARN_CLIENT"
 },
 {
 "name" : "MAPREDUCE2_CLIENT"
 },
 {
 "name" : "ZOOKEEPER_CLIENT"
 },
 {
 "name" : "TEZ_CLIENT"
 },
 {
 "name" : "HIVE_CLIENT"
 }
 ],
 "cardinality" : "5"
 }

This host_groups describes a single node with all of the “master” components installed, and five slaves with just the DataNode, NodeManager and clients installed. Note that some components have depedencies: it’s possible to build an invalid blueprint which contains a HIVE_METASTORE but not a MYSQL_SERVER. The REST API provides appropriate error messages when such a blueprint is submitted.

Configuration

Configuration allows you to override the defaults for any services you’re installing, and it comes in two varieties: global, and service-specific. Global parameters are required for different services: to my knowledge Nagios and Hive require global parameters to be specified – these parameters apply to multiple roles within the cluster, and the API will tell you if any are missing. Most cluster configuration (your typical core-site.xml, hive-site.xml, etc. parameters) can be overriden in the blueprint by specifying a configuration with the leading part of the file name, and then providing a map of the keys to overwrite. The configuration below provides a global variable that Hive requires, and it also overrides some of the default parameters in hive-site.xml. These changes will be propagated to the cluster as if you changed them in the Ambari UI.

"configurations": [
  {
    "global": {
      "hive_metastore_user_passwd": "p"
    }
  },
  {
    "hive-site": {
      "javax.jdo.option.ConnectionPassword": "p",
      "hive.security.authenticator.manager": "org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator",
      "hive.execution.engine": "tez",
      "hive.exec.failure.hooks": "",
      "hive.exec.pre.hooks": "",
      "hive.exec.post.hooks": ""
    }
  }
]

This config will override some parameters in hive-site.xml, as well as setting the metastore password to ‘p’. Note that you can specify more configuration files to override (core-site.xml, hdfs-site.xml, etc.) but each file must be it’s own object in the configurations array, similar to how global and hive-site are handled above.

Once you’ve specified the host groups and any configuration overrides, the Blueprint also needs a stack – the versions of software to install. Right now Ambari only supports HDP – see this table for the stack versions supported in each Ambari release. As a weird constraint, the blueprint name is inside the blueprint itself, along with the stack information. This name must be the same as the name you provide to the REST endpoint, for some reason. To upload a new blueprint to an Ambari server you can use:

$ curl -X POST -H 'X-Requested-By: Pythian' <ambari-host>/api/v1/blueprints/<blueprint name> -d @<blueprint file>

The X-Requested-By header is required, and as noted the blueprint name must match the file.

You can see the entire blueprint file from this example here, feel free to use it as a baseline for your cluster.

Creating the Cluster

Once you’ve written a blueprint with the services and configuration you want, you need to:

  • Create EC2 instances with the correct security groups
  • Install ambari-master on one, and ambari-agent on the others
  • Configure the agents to report to the master
  • Write a file mapping hosts to host groups
  • Push both files (the blueprint and the mapping) to the REST API

Fortunately, we have a Python script that can do that for you! This script will create a benchmarking cluster with a specific number of data nodes, an Ambari master and a separate Hadoop master. It can easily be modified to create multiple classes of machines, if you want to have more host groups than “master” and “slave”. The core of the script (the EC2 interaction and Ambari RPM installation) is stolen from based on work by Ahir Reddy from Databricks, with the Ambari Blueprints support added by yours truly.

If you’re curious about the host mapping file: it has the blueprint name, and an array of host names for every host_group. Corresponding to the example above, the cluster definition would be:

{
  "blueprint":"hadoop-benchmark",
  "host_groups: [
    { 
      "name":"master",
      "hosts":[{"fqdn":"host-1"}]
    },
    {
      "name":"slaves",
      "hosts":[ 
        {"fqdn":"host-2"},
        {"fqdn":"host-3"}  
      ]
  ]
}

You could replace “host-n” with the real domain names for your Amazon instances (use the internal ones!), and create a new cluster over those machines using:

$ curl -X POST -H 'X-Requested-By: Pythian' <ambari-host>/api/v1/clusters/<cluster name> -d @<mapping file>
Conclusion

Ambari Blueprints have some rough edges right now, but they provide a convenient way to deploy all of the services supported by the HDP stack. Watch this space for more posts about my effort to create a repeatable, one-touch, cross-distribution Hadoop SQL benchmark on EC2.

Categories: DBA Blogs

Log Buffer #376, A Carnival of the Vanities for DBAs

Fri, 2014-06-20 08:52

This Log Buffer Editions collects and present you various blog posts from Oracle, SQL Server and MySQL arena.

Oracle:

Oracle SOA Suite Achieves Significant Customer Adoption and Industry Recognition

The Perks of Integrated Business Planning

Why are we so excited about Oracle Database XE R2?

Skipping ACS ramp-up using a SQL Patch

Moving datafiles has always been a pain.

SQL Server:

Using DBCC DROPCLEANBUFFERS When Testing Performance

Data Mining Introduction Part 9: Microsoft Linear Regression

PowerShell One-Liners: Accessing, Handling and Writing Data

Stairway to SQL Server Security: Level 2, Authentication

Matthew Flatt was trying to measure the performance of a tool interacting with Microsoft Azure

MySQL:

Using UDFs for geo-distance search in MySQL

Amidst various blog postings on Docker, a security issue announced yesterday that detailed an exploit of Docker that makes it possible to do container breakout.

Discover the MySQL Central @ OpenWorld Content Catalog

Shinguz: Replication channel fail-over with Galera Cluster for MySQL

SQL queries preview before sent to server in MySQL for Excel

Categories: DBA Blogs

Service Reliability: How Systems Administration is Evolving – Part Two

Thu, 2014-06-19 08:06

In part one of this mini-series, I wrote about how the old ways of systems administration is broken and shared a story from my personal experiences. Today I’ll be talking about how it’s evolved with Site Reliability Engineering (also known as Service Reliability Engineering or SRE).

SRE is born

Interestingly, this is a concept I was pushing back around 2008-2010, but unfortunately my ideas fell on deaf ears.  How do Netflix, Facebook, Dropbox, and Google provide such reliable service? First I can tell you what they don’t do – they don’t throw more hardware or ram at the problem.

So how do you do it?

First everyone has to agree that service reliability is everyone’s problem.  It is the reason our jobs exist! Management must agree, because the way a company thinks and operates needs to change from the top down, and the focus needs to be on service reliability.

Operations and Development teams need to be aligned so that outages are everyone’s problem. Remember the story in my previous blog post? What do you think would have happened if I had forced the developers to reboot the servers every 12 hours, then every 24 hours? Would it have taken weeks or months to resolve the problem? Of course not.

You need your Ops team to know how to code, and understand development. I’m not saying they have to actually be on Dev team, but they need to understand coding, speak the language, and script their way out of issues. Most importantly, they must be on the hook if things go wrong.

At Google and Facebook, their developers are on call. If a release suddenly causes a significant amount of paging/failures, they don’t hire more Ops employees. The developers fill the gap until the issue is fixed, which clearly happens very quickly.

No one wants to be getting paged all night, especially developers. If they’re getting paged due to a bug in their software, you can bet the issue will be resolved in days, not weeks. Making developers responsible for the service reliability means they are required to think about failure in their software development – they’ll have to design in graceful failure modes, and expect their underlying infrastructure to be unreliable. If their little widget takes down the entire site, you can be sure they’re going to be up until it’s corrected.

The bottom line is that software and hardware come together to provide a service, and one isn’t more important than the other. Major wins in reliability can be realized if you align software design, automation, and hardware design.

At Pythian, we offer this as a managed service. Our team has the experience, background, and skills to help get you to this promise land of high reliable, scalable systems. We are building a team of Site Reliability Engineers to help companies benefit and transition to this new paradigm in systems administration, where service reliability is everyone’s problem.

Categories: DBA Blogs

Service Reliability: How Systems Administration is Evolving – Part One

Wed, 2014-06-18 07:52

The world of systems administration is changing, and it’s affecting everyone involved. Today’s blog post is the first of two in mini-series dedicated to Service Reliability: How Systems Administration is Evolving.

The days of formulas that tell us, You need “X” system admins, for every “Y” physical server, and every “Z” VMs are coming to a close. Even the world of IT management is changing.

Why? Because as scale increases, it’s simply impossible to continue at the pace we were at. Google saw this in the mid 2000’s and began the next evolution of systems administration. They recognized that there was no way they could scale up the way things had been managed for decades. In fact, Netflix came to a similar conclusion.

Interestingly, what Google did has nothing to do with technology, rather to do with the philosophy of systems administration. They started a new group which they originally called Production Engineering, and was later renamed Site Reliability Engineering, also known as Service Reliability Engineering or SRE.  At its core, SRE changes the fundamental thinking of IT management. It recognizes site reliability as everyone’s responsibility. Some might say that’s obvious, but in the past it wasn’t.

Worked fine in dev meme.jpg

The old way is broken

Most companies have two very separate and distinct groups. Operations and Development. Historically these two groups are highly siloed, and in some cases, do not get along very well. Why? It comes down to philosophy, really.

Operations folks are driven to ensure systems are up, secure, and reliable. Developers, on the other hand, are driven to create cool new features and applications. Here lies one of the biggest problems.

Years back I worked as an Operations Director, and had a counterpart on the development side who was the Software Engineering Director. We had just completed releasing a major update for one of our platforms, and very quickly we saw we had major issues. Our primary application servers (25+ physical boxes) were becoming unstable after about 12 hours of production load (I won’t go into why this happened, that’s a story for another day.) We quickly identified this, so the Ops team began rebooting these boxes in a rolling fashion. They were boxes that had some specialized hardware in them, and starting/stopping, then testing them took about 15-30 minutes each.  We had a team of about 5 people, which was not a 24/7 group. Clearly this caused significant pain for our Operations staff. We determined that part of the problem was a memory leak. Due to the nature of the release, rolling back simply was not an option.

The initial response I received was that we would just have to deal with it for now, as there were a few other pressing issues they wanted to resolve first. After many sleepless nights and lost weekends, we finally were able to get a update so the systems only needed to be rebooted daily, 7 days a week. It stayed this way for months.

Crazy, right?

But why? It was because the software team, and the management we both reported to, was far more interested in hitting deadlines for features, and new functionality – not how much sleep, or how many days off our Ops employees were getting. I was told on more than one occasion that high availability and recovery were Ops problems, not Development problems.

The core of this problem is simple. Development felt that service reliability was 100%  an Operations problem. Our new release takes 2x more ram? Add more ram to 100 servers! Our new application requires 20 new servers? Sure, with some work it could be cut down to 2-3, but just get the 20 servers. That’s easy!

Without naming names, has anyone else faced this issue? Comment below. Stay tuned for part two, where I’ll be discussing the birth of SRE, how it’s allowed systems administration to evolve, and how to achieve it.

Categories: DBA Blogs

Room for Improvement – Using DBMS_REDEFINITION with Spatial Data

Tue, 2014-06-17 07:53
Resizing a Column in a Table With a Spatial Column

Recently a client asked for help with using DBMS_REDEFINITION to modify a column in a table.

As the intent was to reduce the precision of a numeric column, simply modifying the column would not work, as there was already data in the column:

13:51:44 ora11203fs.jks.com - jkstill@js01 SQL&gt; /
alter table t modify(id number(6,2))
*
ERROR at line 1:
ORA-01440: column to be modified must be empty to decrease precision or scale

The client had made a few attempts use DBMS_REDEFINITION, but ran into several issues with each attempt.

In this case the owner of the destination table was different than the owner of the source table.
As per My Oracle Support, this situation requires several permissions granted that may not normally be necessary for using DBMS_REDEFINITION.
However I have seen similar problems when the source and destination user are the same.

The version of database for the client is 11.2.0.3 running on Solaris
These test are also performed on 11.2.0.3, but on Linux 5 rather than Solaris.

Most of the issues involve permissions that must be granted to the owner of the destination object.
This is true even if the job is being run as SYSDBA.
These errors were due to the table in question having a column of spatial data with a corresponding spatial index.

17:27:49 ora11203fs.jks.com – sys@js01 SQL> desc spdata
Name Null? Type
———————————————– ——– ——————————–
ID NOT NULL NUMBER(28)
RADIUS NOT NULL NUMBER(16,8)
LOCATION NOT NULL MDSYS.SDO_GEOMETRY

Workflow

The information in this article will be presented somewhat backwards to the way that is usually seen.
First I will show a working example of using DBMS_REDEFINITION to redefine the column.
Following that some of the problems will be highlighted, and then some technical references shown.

Doing so will make this article a little more user friendly I think.
If you are facing a similar issue then the parts you need the most are right up front.

A Working Example

Following are the steps to create the test data, setup DBMS_REDEFINITION and complete the process

Create the Test Data

The first step is to create a test user. The permissions granted to this user are quite important, as will be demonstrated later on.

-- redefdest_user.sql

create user redefdest identified by redefdest;

grant resource, connect, create session, unlimited tablespace to redefdest;

grant alter session to redefdest;

grant execute on dbms_redefinition to redefdest;

grant create any sequence  to redefdest;
grant create any table  to redefdest;
grant alter any table  to redefdest;
grant drop any table  to redefdest;
grant lock any table  to redefdest;
grant select any table to redefdest;
grant create any index to redefdest;
grant create any trigger  to redefdest;

Once this process is completed, remember to revoke any privileges that are not needed on a permanent basis.

Now let’s create a simple table with some spatial data:

-- tab_create.sql

create table redefdest.spdata
(
   id number(28,0) not null,
   radius number(16,8) not null,
   location mdsys.sdo_geometry not null
);

The next step is to setup the Geometry Metadata. This must be run as the owner of the SPDATA table.

-- insert_geo_data.sql

delete from user_sdo_geom_metadata where  table_name = 'SPDATA' and column_name = 'LOCATION';

insert into user_sdo_geom_metadata
(table_name, column_name, diminfo, srid)
values
(
   'SPDATA', 'LOCATION',
   sdo_dim_array (
      sdo_dim_element ('Longitude', -180, 180, 0.005),
      sdo_dim_element ('Latitude', -90, 90, 0.005)
   ),
   8307 -- SRID - Spatial Reference IDentifier - see mdsys.cs_srs
);

commit;

select * from user_sdo_geom_metadata;

Now we can insert some test data into the table.

-- insert.sql

insert into spdata(id, radius, location)
select
   id,
   dbms_random.value(10000,20000),
   sdo_geometry(2001, 8307,
       sdo_point_type (floor(dbms_random.value(-180,180)),floor(dbms_random.value(-90,90)) , null),
       null, null
   )
from (
select level id
from dual
connect by level <= 100
) data;

commit;

Now create indexes on the ID column and the spatial data column.

-- idx_create.sql

create index redefdest.spdata_id_idx on redefdest.spdata (id);

create index redefdest.spdata_location_idx on redefdest.spdata (location)
   indextype is mdsys.spatial_index  parameters ('SDO_DML_BATCH_SIZE=2000');
Configure DBMS_REDEFINITION

The goal of this excercise is to change the scale and precision of the RADIUS column.

That cannot be done directly via ALTER TABLE as any attempt to so will cause ORA-01440.

The first task is to create the interim table. This table will be created nearly identical to the source table. The RADIUS column in the interim table will have a smaller scale and precision than the source table.

The columns for the table SPDATA_INTERIM are all set as nullable.
The reason for the will be explained later on.

-- create_interim_table.sql

create table redefdest.spdata_interim
(
   id number(28,0),
   radius number(12,4),
   location mdsys.sdo_geometry
);

Now geo data for the interim table must be created. This is not the same script as seen previously. Though the script name is the same, it is stored in a different directory. The only difference from the previous script is the name of the table used – SPDATA_INTERIM rather than SPDATA.
As before, this SQL must be run as the owner of the table in question, SPDATA_INTERIM in this case.

-- insert_geo_data.sql

insert into user_sdo_geom_metadata
(table_name, column_name, diminfo, srid)
VALUES
(
   'SPDATA_INTERIM', 'LOCATION',
   sdo_dim_array (
      sdo_dim_element ('Longitude', -180, 180, 0.005),
      sdo_dim_element ('Latitude', -90, 90, 0.005)
   ),
   8307 -- SRID  - see mdsys.cs_srs
)
/

commit;

Now that the interim table has been created and the geo data inserted, the redefinition process can begin:

-- redefine.sql

  1  declare
  2  v_col_map varchar(2048) := 'ID ,RADIUS ,LOCATION';
  3  begin
  4     dbms_redefinition.start_redef_table (
  5              uname          => 'REDEFDEST'
  6             ,orig_table     => 'SPDATA'
  7             ,int_table      => 'SPDATA_INTERIM'
  8             ,col_mapping    => v_col_map
  9             ,options_flag   => dbms_redefinition.cons_use_rowid
 10             ,orderby_cols   => null
 11             ,part_name      => null
 12  );
 13* end;
17:34:51 ora11203fs.jks.com - jkstill@js01 SQL> /

PL/SQL procedure successfully completed.

Once the refinition process has completed, the dependent objects can be created.

-- copy_dependent_objects.sql

Wrote file afiedt.buf

  1  declare
  2     v_number_of_errors number := 0;
  3  begin
  4     dbms_redefinition.copy_table_dependents(
  5             uname             => 'REDEFDEST'
  6             ,orig_table       => 'SPDATA'
  7             ,int_table        => 'SPDATA_INTERIM'
  8             ,copy_indexes     =>  dbms_redefinition.cons_orig_params
  9             ,copy_triggers    =>  true
 10             ,copy_constraints =>  true
 11             ,copy_privileges  =>  true
 12             ,ignore_errors    => false
 13             ,num_errors       => v_number_of_errors
 14             ,copy_statistics  => true
 15             ,copy_mvlog       => true
 16     );
 17     dbms_output.put_line('Number of Errors' || v_number_of_errors);
 18* END;
17:35:58 ora11203fs.jks.com - jkstill@js01 SQL> /

PL/SQL procedure successfully completed.

And now finish the redefinition process, check the SPDATA.RADIUS column to ensure it is now NUBMER(12,4), and drop the interim table (now the original table)

-- finish_redef.sql

  1  begin
  2    dbms_redefinition.finish_redef_table (
  3      uname          => 'REDEFDEST'
  4     ,orig_table     => 'SPDATA'
  5     ,int_table      => 'SPDATA_INTERIM'
  6  );
  7* end;
17:35:59 ora11203fs.jks.com - jkstill@js01 SQL> /

17:36:43 ora11203fs.jks.com - jkstill@js01 SQL> alter table redefdest.spdata modify (id not null);

Table altered.

17:44:05 ora11203fs.jks.com - jkstill@js01 SQL>  alter table redefdest.spdata modify (radius not null);
alter table redefdest.spdata modify (radius not null)
                                     *
ERROR at line 1:
ORA-01442: column to be modified to NOT NULL is already NOT NULL

17:44:05 ora11203fs.jks.com - jkstill@js01 SQL> alter table redefdest.spdata modify (location not null);

Table altered.

PL/SQL procedure successfully completed.

16:01:40 ora112304a.jks.com - redefdest@js02 SQL> desc spdata
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                        NOT NULL NUMBER(28)
 RADIUS                                             NUMBER(12,4)
 LOCATION                                  NOT NULL MDSYS.SDO_GEOMETRY

16:01:40 ora112304a.jks.com - redefdest@js02 SQL> select count(*) from spdata where radius is not null;

  COUNT(*)
----------
       100

With all of the required permissions in place, everything works as expected.

Getting to that point however required reading a few Oracle Support notes and some experimentation.

There are several ways this process can fail if one of the requirements is not met.

Alter/Create/Drop Any Table

If the owner of the tables is missing any of ALTER/CREATE/DROP ANY TABLE, the process will fail when copying dependent objects.
This due to a bug in Oracle.

In this test the CREATE ANY TABLE privilege has been revoked, and the entire test re-run:

18:04:25 ora11203fs.jks.com - sys@js01 SQL> revoke create any table from redefdest;

Revoke succeeded.

## Copy Dependents ##

declare
*
ERROR at line 1:
ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
ORA-13249: internal error in Spatial index: [mdidxrbd]
ORA-13249: Error in Spatial index: index build failed
ORA-13249: Error in R-tree: [mdrcrtscrt]
ORA-13231: failed to create index table [MDRT_190DB$] during R-tree creation
ORA-13249: Stmt-Execute Failure: CREATE TABLE "REDEFDEST".MDRT_190DB$ (NODE_ID NUMBER, NODE_LEVEL NUMBER, INFO BLOB)
LOB (INFO) STORE AS (NOCACHE)  PCTFREE 2
ORA-29400: data cartridge error
ORA-01031: insufficient privileges
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1155
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1885
ORA-06512: at line 4

Likewise if CREATE ANY SEQUENCE is revoked, the process will fail.
(CREATE ANY TABLE had already been re-granted)


18:12:23 ora11203fs.jks.com - sys@js01 SQL> revoke create any sequence from redefdest;

Revoke succeeded.

## Copy Dependents ##

declare
*
ERROR at line 1:
ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
ORA-13249: internal error in Spatial index: [mdidxrbd]
ORA-13249: Error in Spatial index: index build failed
ORA-13249: Error in R-tree: [mdrcrtscrt]
ORA-13231: failed to create index table [MDRS_190F9$] during R-tree creation
ORA-13249: Stmt-Execute Failure: CREATE SEQUENCE "REDEFDEST".MDRS_190F9$ ORDER START WITH 1 CACHE 100
ORA-29400: data cartridge error
ORA-01031: insufficient privileges
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1155
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1885
ORA-06512: at line 4

So each of these permission are required for this process to succeed, even when running as SYSDBA.

Permissions Required

The test objects created for this article are quite simple.
As such there are other potential errors that could occur due to the destination account not having all required permissions for a successful use of dbms_redefinition on a table with spatial data.
Before granting new privileges to an account, be sure to audit the current privileges.
That way you will know which privileges can be safely revoked when the table modifications are complete.

Constraint Issues

There are some bugs associated with DBMS_REDEFINITION.

The COPY_DEPENDENT_OBJECTS subprogram seemingly does not handle some constraints very well.

The original table SPDATA was created with all columns set to NOT NULL.

The interim table SPDATA_INTERIM was created with all columns set to NULL.

If the interim table is created with one or more columns as NOT NULL, the following error occurs:

## Copy Dependents ##
declare
*
ERROR at line 1:
ORA-01442: column to be modified to NOT NULL is already NOT NULL
ORA-06512: at "SYS.DBMS_REDEFINITION", line 984
ORA-06512: at "SYS.DBMS_REDEFINITION", line 1899
ORA-06512: at line 4

Part of the job of the procedure is to create check constraints as found on the original table.

That part of the process doesn’t seem to work quite correctly.

When the process has completed, the columns appear as NULLable to the sqlplus DESC command, but that is not the case.

18:21:15 ora11203fs.jks.com - jkstill@js01 SQL> desc spdata
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 ID									    NUMBER(28)
 RADIUS 								    NUMBER(12,4)
 LOCATION								    MDSYS.SDO_GEOMETRY

18:21:17 ora11203fs.jks.com - jkstill@js01 SQL> insert into spdata values(null,null,null);
insert into spdata values(null,null,null)
                          *
ERROR at line 1:
ORA-01400: cannot insert NULL into ("REDEFDEST"."SPDATA"."ID")

18:21:35 ora11203fs.jks.com - jkstill@js01 SQL> @check_cons

TABLE NAME		       CONSTRAINT_NAME		      C SEARCH_CONDITION	       STATUS
------------------------------ ------------------------------ - ------------------------------ --------
SPDATA			       SYS_C0018231		      C "ID" IS NOT NULL	       ENABLED
SPDATA			       SYS_C0018232		      C "RADIUS" IS NOT NULL	       ENABLED
SPDATA			       SYS_C0018233		      F "LOCATION" IS NOT NULL	       ENABLED

3 rows selected.

The output of the DESC command and the contents of DBA_CONSTRAINTS do not agree.

While it would be interesting to dig into the lower level reasons for why this is happening, it for now just an inconvience while there are more pressing matters to work on.

References

Following are the Oracle Notes considered when troubleshooting this issue.

Registering constraints

The DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT procedure could be used to better deal with constraints, at least according to this Oracle Note:
HOW TO USE DBMS_REDEFINITION.REGISTER_DEPENDENT_OBJECT (Doc ID 1304838.1)
I have not yet tried this.

In regard to desc table not showing constraints

This may be the issue: Bug 16023293 : DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS DOES NOT VALIDATE ALL CONSTRAINTS
This bug affects Oracle 11.2.0.3 on Solaris, so it seems likely to be an issue for the client, and may well affect other platforms as well.

Direct privs granted to table owner

How to Re-Organize a Table Online (Doc ID 177407.1)
ORA-13233 and ORA-01031 When Creating a Spatial Index in Another Schema (Doc ID 251593.1)

Test Code: spatial_redef_dist.zip

Categories: DBA Blogs

Instant REST API For Any Database

Mon, 2014-06-16 07:48

Not so long ago, I was playing with ElasticSearch, which has the interesting characteristic of having a REST API as its primary interface. Sure, it’s a little more stilted and awkward than any native interface but, on the other hand, it’s a nice universal type of API. Any language that can make a http request can talk to it and, hey, bad comes to worse, even ‘curl’ will do. It would be kinda cool if other databases had such a web service.

And then I began to think…

Don’t we have DBIx::Class::Schema::Loader, which can connect to a database and auto-generate its DBIx::Class schema?


package MyDB;

use parent 'DBIx::Class::Schema::Loader'; 

...;

# later on

my $schema = MyDB->connect( 'dbi:SQLite:foo.db' ); # boom, we have our schema

And once we have a DBIx::Class representation of a schema, can’t we introspect it and pretty much get everything there is to know about it?


use Data::Printer;

# get all the table names
my @tables = $schema->sources;

# and all the columns of all the tables
for my $table ( $schema->sources ) {
    say "Table $table";
    p $schema->source($table)->columns_info;
}

That is, that’s if we want to do it manually, considering that there’s already SQL::Translator that can do most of the job for us.


use SQL::Translator;

print SQL::Translator->new (
    parser      => 'SQL::Translator::Parser::DBIx::Class',
    parser_args => {
        dbic_schema => $schema,
    },
    producer    => 'JSON',
)->translate;

Of course, since we are talking web service, we will want to pass everything back and forth using JSON, including database entries. Well, that’s hardly a problem if we use DBIx::Class::Helper::Row::ToJSON.

So it seems we have the database side covered. For the web framework? You’ll probably not be surprised to see me go with Dancer. Not only can we leverage the serializers and plugins like Dancer::Plugin::DBIC, but setting routes are ridiculously easy.


get '/_tables' => sub {
    return [ schema->sources ];
};

Even niftier: remember that Dancer routes are defined at runtime, so we can introspect that schema as much as we want and come up with any route we can dream of.


my @primary_key = schema->source($table)->primary_columns;
my $row_url = join '/', undef, $table, ( '*' ) x @primary_key;
 # GET ///
get $row_url => sub {
    my @ids = splat;
    return $schema->resultset($table)->find({
        zip @primary_key, @ids
    });
};
 # GET /
get "/$table" => sub {
    my @things = $schema->resultset($table)->search({ params() })->all;
    return \@things;
};
 # create new entry
post "/$table" => sub {
    $schema->resultset($table)->create({ params() });
};

Added bonus: the way Dancer’s params() conglomerate parameters defined in the query string and in the serialized body of the request plays in our favor: simple queries can be passed directly via the url, and more complicated ones can be defined as JSON structures.

So, you put all of this together, and you obtain waack. All it needs is a dsn pointing to the right database (and credentials, if needed). To illustrate, let’s try with my Digikam SQLite database.


$ waack dbi:SQLite:digikam4.db
>> Dancer 1.3124 server 28914 listening on http://0.0.0.0:3000
>> Dancer::Plugin::DBIC (0.2100)
== Entering the development dance floor ...

And now, let’s fire up App::Presto as our REST client.


$ presto http://enkidu:3000

http://enkidu:3000> type application/json

First, we can retrieve all the table names.


http://enkidu:3000> GET /_tables
[
   "TagsTree",
   "ImageMetadata",
   "Tag",
   "Setting",
   "ImageRelation",
   "ImageTag",
   "ImageProperty",
   "ImageInformation",
   "ImageHaarMatrix",
   "ImageCopyright",
   "VideoMetadata",
   "ImageHistory",
   "DownloadHistory",
   "Search",
   "ImageTagProperty",
   "Image",
   "Album",
   "ImagePosition",
   "TagProperty",
   "AlbumRoot",
   "ImageComment"
]

We can also get the whole schema.


http://enkidu:3000> GET /_schema
{
   "translator" : {
      "producer_args" : {},
      "show_warnings" : 0,
      "add_drop_table" : 0,
      "parser_args" : {
         "dbic_schema" : null
      },
      "filename" : null,
      "no_comments" : 0,
      "version" : "0.11018",
      "parser_type" : "SQL::Translator::Parser::DBIx::Class",
      "trace" : 0,
      "producer_type" : "SQL::Translator::Producer::JSON"
   },
   "schema" : {
      "tables" : {
         "ImageRelations" : {
            "options" : [],
            "indices" : [],
            "order" : "12",
            "name" : "ImageRelations",
            "constraints" : [
               {
                  "type" : "UNIQUE",
                  "deferrable" : 1,
                  "name" : "subject_object_type_unique",
                  "on_delete" : "",
                  "reference_fields" : [],
                  "fields" : [
                     "subject",
                     "object",
                     "type"
                  ],
                  "match_type" : "",
                  "reference_table" : "",
                  "options" : [],
                  "expression" : "",
                  "on_update" : ""
               }
            ],
...

Too much? We can get the columns of a single table.


http://enkidu:3000> GET /Tag/_schema
{
   "iconkde" : {
      "is_nullable" : 1,
      "data_type" : "text",
      "is_serializable" : 1
   },
   "name" : {
      "is_serializable" : 1,
      "data_type" : "text",
      "is_nullable" : 0
   },
   "id" : {
      "is_nullable" : 0,
      "data_type" : "integer",
      "is_auto_increment" : 1,
      "is_serializable" : 1
   },
   "icon" : {
      "is_nullable" : 1,
      "data_type" : "integer",
      "is_serializable" : 1
   },
   "pid" : {
      "is_serializable" : 1,
      "is_nullable" : 1,
      "data_type" : "integer"
   }
}

Query that table, with a simple condition…


http://enkidu:3000> GET /Tag id=1
[
   {
      "name" : "orchid",
      "icon" : null,
      "id" : 1,
      "pid" : 0,
      "iconkde" : null
   }
]

… or with something a little more oomphie.


$ curl -XGET -H Content-Type:application/json --data '{"name":{"LIKE":"%bulbo%"}}' http://enkidu:3000/Tag
[
   {
      "pid" : 1,
      "name" : "Bulbophyllum 'Melting Point'",
      "icon" : null,
      "id" : 32,
      "iconkde" : "/home/yanick/Pictures/My Plants/IMG_0461.JPG"
   },
   {
      "id" : 56,
      "iconkde" : "tag",
      "icon" : null,
      "pid" : 39,
      "name" : "Bulbophyllum ebergardetii"
   },
   {
      "name" : "bulbophyllum",
      "pid" : 564,
      "iconkde" : null,
      "id" : 565,
      "icon" : 0
   }
]

Btw: I cheated for that last one. Presto doesn’t send body with GET requests. And Dancer doesn’t deserialize GET bodies either. Patches will be written tonight.

Anyway, back with the show. We can also select specific rows by primary keys.


http://enkidu:3000> GET /Tag/1
{
   "id" : 1,
   "iconkde" : null,
   "pid" : 0,
   "icon" : null,
   "name" : "orchid"
}

Create new rows.


http://enkidu:3000> POST /Tag '{"name":"nepenthes","pid":0}'
{
   "pid" : 0,
   "name" : "nepenthes",
   "iconkde" : null,
   "icon" : null,
   "id" : 569
}

And do updates.


http://enkidu:3000> PUT /Tag/569 '{"icon":"img.png"}'
{
   "icon" : "img.png",
   "iconkde" : null,
   "pid" : 0,
   "name" : "nepenthes",
   "id" : 569
}

Not too shabby, isn’t? Mostly considering that, if you look at the source of waack, you’ll see that it barely clock over 100 lines of code. Take a minute and let this sink in.

One hundred lines of code. For a universal database REST web service.

If that’s not standing on the shoulders of giants, then I don’t know what is.

Categories: DBA Blogs

Internet Scale Design: Part Two

Fri, 2014-06-13 08:01

In my previous blog post, I emphasized that internet scale design can be implemented for any type of company. Whether it’s a small, bootstrapped startup or a rapidly growing, well-funded tier 2. But if it’s suitable for that many companies, why isn’t everyone moving into the cloud? In my opinion, there are two reasons.

First, the model of utility computing doesn’t work for all business models. It is most effective in models where demand changes, where there are peaks and valleys for larger scale systems. It also works well as a way to get your startup or project off the ground with little-to-no capital investment. In the story I began in my previous blog post, the seasonality of their environment made them a perfect candidate.

The second is more of a people problem. In many companies,  IT leadership, SysAdmins, Developers, DBAs, and everyone else involved in service management, have been working with whatever technology stack that company has been using for years. It’s important to remember that most SysAdmins see their primary job as keeping things up and running, so we typically prefer working with things we know vs things we don’t.

If a C-level executive or VP returns from a conference about cloud, and issues a mandate that they need to “move everything to the cloud!” to remain “competitive” the SysAdmins will likely fail. Why? Not because they’re not smart enough, but because they simply don’t know enough about it.

While it would be ideal for the COO to say, “I want us to look into moving our platform into AWS, so I’m going to send you to get Amazon certified,” it rarely happens. Usually it sounds more like, “You’re smart, you’ll figure it out. Oh sure, you already have a full-time job keeping the lights on, but just squeeze it in when you can. We’ll need to see a POC by the end of the quarter.”

I don’t need to tell you how this ends ? it’s obvious. It will fail almost every time.

One of the amazing benefits to the Pythian model is that our teams are exposed to a wide variety of systems. We have built auto-scaling systems in AWS, OpenStack systems, VMWare systems, as well as legacy physical systems we support. Our teams are not relegated to whichever tech stack they happened to be stuck with for the last five years.

The bottom line here is that it doesn’t matter what kind of company you’re at – Whether it’s a small retailer, midsize tier 3, or larger tier 2, if you’re willing to sign on for the concept of site reliability engineering and commit to it, together we can accomplish some amazing things, all for a price you can afford.

Categories: DBA Blogs