Kevin Closson

Subscribe to Kevin Closson feed Kevin Closson
Platform, Database and Storage Topics
Updated: 17 hours 59 min ago

Expecting Sum-Of-Parts Performance From Shared Solid State Storage? I Didn’t Think So. Neither Should Exadata Customers. Here’s Why.

Tue, 2016-06-28 19:55



Last month I had the privilege of delivering the key note session to the quarterly gathering of Northern California Oracle User Group. My session was a set of vignettes in a theme regarding modern storage advancements. I was mistaken on how much time I had for the session so I skipped over a section about how we sometimes still expect systems performance to add up to a sum of its parts. This blog post aims to dive in to this topic.

To the best of my knowledge there is no marketing literature about XtremIO Storage Array that suggests the array performance is due to the number of solid state disk (SSD) drives found in the device. Generally speaking, enterprise all-flash storage arrays are built to offer features and performance–otherwise they’d be more aptly named Just a Bunch of Flash (JBOF).  The scope of this blog post is strictly targeting enterprise storage.

Wild, And Crazy, Claims

Lately I’ve seen a particular slide–bearing Oracle’s logo and copyright notice–popping up to suggest that Exadata is vastly superior to EMC and Pure Storage arrays because of Exadata’s supposed unique ability to leverage all flash bandwidth in the Exadata X6 family. You might be able to guess by now that I aim to expose how invalid this claim is. To start things off I’ll show a screenshot of the slide as I’ve seen it. Throughout the post there will be references to materials I’m citing.

DISCLAIMER: The slide I am about to show was not a fair use sample of content from and it therefore may not, in fact, represent the official position of Oracle on the matter. That said, these slides do bear logo and copyright! So, then, the slide:


I’ll start by listing a few objections. My objections are always based on science and fact so objecting to content–in particular–is certainly appropriate.

  1. The slide suggests an EMC XtremIO 4 X-Brick array is limited to 60 megabytes per second per “flash drive.”
    1. An XtremIO 4 X-Brick array has 100 Solid State Disks (SSD)–25 per X-Brick. I don’t know where the author got the data but it is grossly mistaken. No, a 4 X-Brick array is not limited to 60 * 100 megabytes per second (6,000MB/s). An XtremIO 4 X-Brick array is a 12GB/s array: click here. In fact, even way back in 2014 I used Oracle Database 11g Real Application Clusters to scan at 10.5GB/s with Parallel Query (click here). Remember, Parallel Query spends a non-trivial amount of IPC and work-brokering setup time at the beginning of a scan involving multiple Real Application cluster nodes. That query startup time impacts total scan elapsed time thus 10.5 GB/s reflects the average scan rate that includes this “dead air” query startup time. Everyone who uses Parallel Query Option is familiar with this overhead.
  2. The slide suggests that 60 MB/s is “spinning disk level throughput.”
    1. Any 15K RPM SAS (12Gb) or FC hard disk drive easily delivers sequential scans throughput of more than 200 MB/s.
  3. The slide suggests XtremIO cannot scale out.
    1. XtremIO architecture is 100% scale out. One can start with a single X-Brick and add up to 7 more. In the current generation scaling out in this fashion with XtremIO adds 25 more SSDs, storage controllers (CPU) and 4 more Fibre Channel ports per X-Brick.
  4. The slide suggests “bottlenecks at server inputs” further retard throughput when using Fibre Channel.
    1. This is just silly. There are 4 x 8GFC host-side FC ports per XtremIO X-Brick. I routinely test Haswell-EP 2-socket hosts with 6 active 8GFC ports (3 cards) per host. Can a measly 2-socket hosts really drive 12 GB/s Oracle scan bandwidth? Yes! No question. In fact, challenge me on that and I’ll show AWR proof of a single 2-socket host sustaining Oracle table scan bandwidth at 18 GB/s. No, actually, I won’t make anyone go to that much trouble. Click the following link for AWR proof that a single host with 2 6-core Haswell-EP (2s12c24t) processors can sustain Oracle Database 12c scan bandwidth of 18 GB/s: click here. I don’t say it frequently enough, but it’s true; you most likely do not know how powerful modern servers are!
  5. The slide says Exadata achieve “full flash throughput.”
    1. I’m laughing, but that claim is, in fact, the perfect segue to the next section.
Full Flash Throughput Scan Bandwidth

The slide accurately states that the NVMe flash card in the Exadata X6 model are rated at 5.5GB/s. This can be seen in the F320 datasheet. Click the following link for a screenshot of the datasheet: click here. So the question becomes, can Exadata really achieve full utilization of all of the NVMe flash cards configured in the Exadata X6? The answer no, but sort of. Please allow me to explain.

The following graph depicts the reality of how close a full-rack Exadata X6 comes to realizing full flash potential. As we know a full-rack Exadata has 14 storage servers. The High Capacity (HC) model has 4 NVMe cards per storage server purposed as a flash cache. The HC model also comes with 12 7,200 RPM hard drives per storage server as per the datasheet.  The following graph shows that yes, indeed Exadata X6 does realize full flash potential when performing a fully-offloaded scan (Smart Scan). After all, 4 * 14 * 5.5 is 308 and the datasheet cites 301 GB/s scan performance for the HC model. This is fine and dandy but it means you have to put up with 168 (12 * 14) howling 7,200 RPM hard disks if you are really intent on harnessing the magic power of full-flash potential! Why the sarcasm? It’s simple really–just take a look at the graph and notice that the all-flash EF model realizes just a slight bit more than 50% of the full flash performance potential. Indeed, the EF model has 14 * 8 * 5.5 == 616 GB/s of full potential available–but not realizable.

No, Exadata X6 does not–as the above slide suggests–harness the full potential of flash. Well, not unless you’re willing to put up with 168 round, brown, spinning thingies in the configuration. Ironically, it’s the HDD-Flash hybrid HC model that enjoys the “full flash potential.” I doubt the presenter points this bit out when slinging the above slide.



The above slide doesn’t actually suggest that Exadata X6 achieves full flash potential for IOPS, but since these people made me crack open the datasheets and use my brain for a moment or two I took it upon myself to do the calculations. The following graph shows the delta bewteen full flash IOPS potential for the full-rack HC and EF Exadata X6 models.

No…it doesn’t realize full flash potential in terms of IOPS either.



Here is a link to the full slide deck containing the slide I focused on in this post:

Just in case that copy of the deck disappears, I pushed a copy up the the WayBack Machine: click here.


XtremIO Storage Array literature does not suggest that the performance characteristics of the array are a simple product of how many component SSDs the array is configured with. To the best of my knowledge neither does Pure Storage suggest such a thing.

Oracle shouldn’t either. I made that point clear.

Filed under: oracle

You Scratch Your Head And Ponder Why It Is You Go With Maximum Core Count Xeons. I Can’t Explain That, But This Might Help.

Tue, 2016-06-14 00:36

Folks that have read my blog for very long know that I routinely point out that Intel Xeon processors with fewer cores (albeit same TDP) get more throughput per core. Recently I had the opportunity to do some testing of a 2-socket host with 6-core Haswell EP Xeons (E5-2643v3) connected to networked all-flash storage. This post is about host capability so I won’t be elaborating on the storage. I’ll say that it was block storage, all-flash and networked.

Even though I test myriads of systems with modern Xeons it isn’t often I get to test the top-bin parts that aren’t core-packed.  The Haswell EP line offers up to 18-core parts in a 145w CPU.  This 6-core part is 135w and all cores clock up to 3.7GHz–not that clock speed is absolutely critical for Oracle Database performance mind you.

Taking It For a Spin

When testing for Oracle OLTP performance the first thing to do is measure the platform’s ability to deliver random single-block reads (db file sequential read). To do so I loaded 1TB scale SLOB 2.3 in the single-schema model. I did a series of tests to find a sweet-spot for IOPS which happened to be at 160 sessions. The following is a snippet of the AWR report from a 5-minute SLOB run with UPDATE_PCT=0. Since this host has a total of 12 cores I should think 8KB read IOPS of 625,000 per second will impress you. And, yes, these are all db file sequential reads.


At 52,093 IOPS per CPU core I have to say this is the fastest CPU I’ve ever tested. It takes a phenomenal CPU to handle this rate of db file sequential read payload. So I began to wonder how this would compare to other generations of Xeons. I immediately thought of the Exadata Database Machine data sheets.

Before I share some comparisons I’d like to point out that there was a day when the Exadata data sheets made it clear that IOPS through the Oracle Database buffer cache costs CPU cycles–and, in fact, CPU is often the limiting factor. The following is a snippet from the Exadata Database Machine X2 data sheet that specifically points out that IOPS are generally limited by CPU. I know this. It is, in fact, why I invented SLOB way back in the early 2000s. I’ve never seen an I/O testing kit that can achieve more IOPS per DB CPU than is possible with SLOB.


Oracle stopped using this foot note in the IOPS citations for Exadata Database Machine starting with the X3 generation. I have no idea why they stopped using this correct footnote. Perhaps they thought it was a bit like stating the obvious. I don’t know. Nonetheless, it is true that host CPU is a key limiting factor in a platform’s ability to cycle IOPS through the SGA. As an aside, please refer to this post about calibrate_io for more information about the processor ramifications of SGA versus PGA IOPS.

So, in spite of the fact that Oracle has stopped stating the limiting nature of host CPU on IOPS, I will simply assert the fact in this blog post. Quote me on this:

Everything is a CPU problem

And cycling IOPS through the Oracle SGA is a poster child for my quotable quote.

I think the best way to make my point is to simply take the data from the Exadata Database Machine data sheets and put it in a table that has a row for my E5-2643v3 results as well. Pictures speak thousands of words. And here you go:


AWR Report

If you’d like to read the full AWR report from the E5-2643v3 SLOB test that achieved 625,000 IOPS please click on the following link: AWR (click here).


X2 data sheet
X3 data sheet
X4 data sheet
X5 data sheet
X6 data sheet


Filed under: oracle

Yes, You Must Use CALIBRATE_IO. No, You Mustn’t Use It To Test Storage Performance.

Fri, 2016-06-03 00:08

I occasionally get questions from customers and colleagues about performance expectations for the Oracle Database procedure called calibrate_io on XtremIO storage. This procedure must be executed in order to update the data dictionary. I assert, however, that it shouldn’t be used to measure platform suitability for Oracle Database physical I/O. The main reason I say this is because calibrate_io is a black box, as it were.

The procedure is, indeed, documented so it can’t possibly be a “black box”, right? Well, consider the fact that the following eight words are the technical detail provided in the Oracle documentation regarding what calibrate_io does:

This procedure calibrates the I/O capabilities of storage.

OK, I admit it. I’m being too harsh. There is also this section of the Oracle documentation that says a few more words about what this procedure does but not enough to make it useful as a platform suitability testing tool.

A Necessary Evil?

Yes, you must run calibrate_io. The measurements gleaned by calibrate_io are used by the query processing runtime (specifically involving Auto DOP functionality). The way I think of it is similar to how I think of gathering statistics for CBO. Gathering statistics generates I/O but I don’t care about the I/O it generates. I only care that CBO might have half a chance of generating a reasonable query plan given a complex SQL statement, schema and the nature of the data contained in the tables. So yes, calibrate_io generates I/O—and this, like I/O generated when gathering statistics, is I/O I do not care about. But why?

Here are some facts about the I/O generated by calibrate_io:

  • The I/O is 100% read
  • The reads are asynchronous
  • The reads are buffered in the process heap (not shared buffers in the SGA)
  • The code doesn’t even peek into the contents of the blocks being read!
  • There is limited control over what tablespaces are accessed for the I/O
  • The results are not predictable
  • The results are not repeatable
My Criticisms

Having provided the above list of calibrate_io characteristics, I feel compelled to elaborate.

About Asynchronous I/O

My main issue with calibrate_io is it performs single-block random reads with asynchronous I/O calls buffered in the process heap. This type of I/O has nothing in common with the main reason random single-block I/O is performed by Oracle Database. The vast majority of single-block random I/O is known as db file sequential read—which is buffered in the SGA and is synchronous I/O. The wait event is called db file sequential read because each synchronous call to the operating system is made sequentially, one after the other by foreground processes. But there is more to SGA-buffered reads than just I/O.

About Server Metadata and Mutual Exclusion

Wrapped up in SGA-buffered I/O is all the necessary overhead of shared-cache management. Oracle can’t just plop a block of data from disk in the SGA and expect that other processes will be able to locate it later. When a process is reading a block into the SGA buffer cache it has to navigate spinlocks for the protected cache contents metadata known as cache buffers chains. Cache buffers chains tracks what blocks are in the buffer cache by their on-disk address.  Buffer caches, like that in the SGA, also need to track the age of buffers. Oracle processes can’t just use any shared buffer. Oracle maintains buffer age in metadata known as cache buffers lru—which is also spinlock-protected metadata.

All of this talk about server metadata means that as the rate of SGA buffer cache block replacement increases—with newly-read blocks from storage—there is also increased pressure on these spinlocks. In other words, faster storage means more pressure on CPU. Scaling spinlocks is a huge CPU problem. It always has been—and even more so on NUMA systems. Testing I/O performance without also involving these critical CPU-intensive code paths provides false comfort when trying to determine platform suitability for Oracle Database.

Since applications to not drive random single-block asynchronous reads in Oracle Database, why measure it? I say don’t! Yes, execute calibrate_io, for reasons related to Auto DOP functionality, but not for a relevant reading of storage subsystem performance.

About User Data

This is one that surprises me quite frequently. It astounds me how quick some folks are to dismiss the importance of test tools that access user data. Say what?  Yes, I routinely point out that neither calibrate_io nor Orion access the data that is being read from storage. All Orion and calibrate_io do is perform the I/O and let the data in the buffer remain untouched.  It always seems strange to me when folks dismiss the relevance of this fact. Is it not database technology we are talking about here? Databases store your data. When you test platform suitability for Oracle Database I hold fast that it is best to 1) use Oracle Database (thus an actual SQL-driven toolkit as opposed to an external kit like Orion or fio or vdbench or any other such tool) and 2) that the test kit access rows of data in the blocks! I’m just that way.

Of course SLOB (and other SQL-driven test kits such as Swingbench do indeed access rows of data). Swingbench handily tests Oracle Database transaction capabilities and SLOB uses SQL to perform maximum I/O per host CPU cycle. Different test kits for different testing.

A Look At Some Testing Results

The first thing about calibrate_io I’ll discuss in this section is how the user is given no control or insight into what data segments are being accessed. Consider the following screenshot which shows:

  1. Use of the calibrate.sql script found under the misc directory in the SLOB kit (SLOB/misc/calibrate.sql) to achieve 371,010 peak IOPS and zero latency. This particular test was executed with a Linux host attached to an XtremIO array. Um, no, the actual latencies are not zero.
  2. I then created a 1TB tablespace. What is not seen in the screenshot is that all the tablespaces in this database are stored in an ASM disk group consisting of 4 XtremIO volumes. So the tablespace called FOO resides in the same ASM disk group. The ASM disk group uses external redundancy.
  3. After adding a 1TB tablespace to the database I once again executed calibrate_io and found that the IOPS increased 13% and latencies remained at zero. Um, no, the actual latencies are not zero!
  4. I then offlined the tablespace called FOO and executed calibrate_io to find that that IOPS fell back to within 1% of the first sample.
  5. Finally, I onlined the tablespace called FOO and the IOPS came back to within 1% of the original sample that included the FOO tablespace.
A Black Box

My objections to this result is calibrate_io is a black box. I’m left with no way to understand why adding a 1TB tablespace improved IOPS. After all, the tablespace was created in the same ASM disk group consisting of block devices provisioned from an all-flash array (XtremIO). There is simply no storage-related reason for the test result to improve as it did.



More IOPS, More Questions. I Prefer Answers.

I decided to spend some time taking a closer look at calibrate_io but since I wanted more performance capability I moved my testing to an XtremIO array with 4 X-Bricks and used a 2-Socket Xeon E5-2699v3 (HSW-EP 2s36c72t) server to drive the I/O.

The following screenshot shows the result of calibrate_io. This test configuration yielded 572,145 IOPS and, again, zero latency. Um, no, the real latency is not zero. The latencies are sub-millisecond though. The screen shot also shows the commands in the SLOB/misc/calibrate.sql file. The first two arguments to DBMS_RESOURCE_MANAGER.CALIBRATE_IO are “in” parameters. The value seen for parameter 2 is not the default. The next section of this blog post shows a variety of testing with varying values assigned to these parameters.



As per the documentation, the first parameter to calibrate_io is “approximate number of physical disks” being tested and the second parameter is “the maximum tolerable latency in milliseconds” for the single-block I/O.


As the table above shows I varied the “approximate number of physical disks” from 1 to 10,000 and the “maximum tolerable latency” from 10 to 20 and then 100. For each test I measured the elapsed time.

The results show us that the test requires twice the elapsed time with 1 approximate physical disk as it does for with 10,000 approximate physical disks. This is a nonsensical result but without any documentation on what calibrate_io actually does we are simply left scratching our heads. Another oddity is that with 10,000 approximate disks the throughput in megabytes per second is reduced by nearly 40% and that is without regard for the “tolerable latency” value. This is clearly a self-imposed limited within calibrate_io but why is the big question.

I’ll leave you, the reader, to draw your own conclusions about the data in the table. However, I use the set of results with “tolerable latency” set to 20 as validation for one of my indictments above. I stated calibrate_io is not predictable. Simply look at the set of results in the 20 “latency” parameter case and you too will conclude calibrate_io is not predictable.

So How Does CALIBRATE_IO Compare To SLOB?

I get this question quite frequently. Jokingly I say it compares in much the same way a chicken compares to a snake. They both lay eggs. Well, I should say they both perform I/O.

I wrote a few words above about how calibrate_io uses asynchronous I/O calls to test single-block random reads. I also have pointed out that SLOB performs the more correct synchronous single block reads. There is, however, an advanced testing technique many SLOB users employ to test PGA reads with SLOB as opposed to the typical SLOB reads into the SGA. What’s the difference? Well, revisit the section above where I discuss the server metadata management overhead related to reading blocks into the SGA. If you tweak SLOB to perform full scans you will test the flow of data through the PGA and thus the effect of eliminating all the shared-cache overhead. The difference is dramatic because, after all, “everything is a CPU problem.”

In a subsequent blog post I’ll give more details on how to configure SLOB for direct path with single-block reads!

To close out this blog entry I will show a table of test results comparing some key time model data. I collected AWR reports when calibrate_io was running as well as SLOB with direct path reads and then again with the default SLOB with SGA reads. Notice how the direct path SLOB increased IOPS by 19% just because blocks flowed through the PGA as opposed to the SGA. Remember, both of the SLOB results are 100% single-block reads. The only difference is the cache management overhead is removed. This is clearly seen by the difference in DB CPU. When performing the lightweight PGA reads the host was able to drive 29,884 IOPS per DB CPU but the proper SLOB results (SGA buffered) shows the host could only drive 19,306 IOPS per DB CPU. Remember DB CPU represents processor threads utilization on a threaded-processor. These results are from a 2s36c72t (HSW-EP) so these figures could also be stated as per DB CPU or per CPU thread.

If you are testing platforms suitability for Oracle it’s best to not use a test kit that is artificially lightweight. Your OLTP/ERP application uses the SGA so test that!

The table also shows that calibrate_io achieved the highest IOPS but I don’t care one bit about that!


AWR Reports

I’d like to offer the following links to the full AWR reports summarized in the above table:

Additional Reading Summary

Use calibrate_io. Just don’t use it to test platform suitability for Oracle Database.

Filed under: oracle

Is SLOB AWR Generation Really, Really, Really Slow on Oracle Database Yes, Unless…

Tue, 2016-03-08 16:36

If you are testing SLOB against and find that the AWR report generation phase of is taking an inordinate amount of time (e.g., more than 10 seconds) then please be aware that, in the SLOB/awr subdirectory, there is a remedy script rightly called 11204-awr-stall-fix.sql.

Simply execute this script when connected to the instance with sysdba privilege and the problem will be solved.


Filed under: oracle Tagged:, Automatic Workload Repository, AWR, SLOB, SLOB Testing

Reblogged: Providing A Persistent Data Volume to EMC XtremIO Using ClusterHQ Flocker, Docker And Marathon

Thu, 2016-02-25 15:53

I don’t reblog very often–if ever. However, this blog post from EMC’s Itzik Reich is a jewel. If you are like everyone else in IT and are starting to take an interest in Docker I recommend viewing Itzik’s post!

Itzikr's Blog


Containers are huge, that’s not a secret to anyone in the IT industry, customers are testing the waters and looking for many ways to utilize containers technologies, it is also not a secret that the leading vendor in this technology is docker.

But docker itself isn’t perfect yet, while it’s as trendy as trendy can get, there are many ways around the docker runtime to provide cluster management etc’..

It all started with a customer request some weeks ago, their request was “can you show us how do you integrate with docker, Marathon (to provide containers orchestration) and ClusterHQ Flocker to provide a persistent data volume..sounds easy right? J

Wait a second, isn’t containers technologies supposed to be completely lossless, designed to fail and do not need any persistent data that will survive a failure in case a container dies??

Well, that’s exactly where customers are asking things that…

View original post 323 more words

Filed under: oracle

Performance Data Visualization for SLOB. The SLOB Expert Community is Vibrant!

Mon, 2016-02-22 13:58

Thanks to Nikolay Savvinov (@oradiag) for his excellent post on how to wrap his scripts around the SLOB test driver ( to capture and produce performance data visualization graphs.  I recommend a visit to his post here:

Performance Data Visualization with SLOB


As always, the link for SLOB is: Obtain the SLOB Kit and Helpful Information Here

Filed under: oracle

Little Things Doth Crabby Make – Part XVIV: Enterprise Manager 12c Cloud Control Install Problem.

Thu, 2015-11-12 14:42

This is a short post to help out any possible “googlers” looking for an answer to why their EM Cloud Control install is failing in the make phase with

Note, this EM install was taking place on an Oracle Linux 7.1 host.

The following snippet shows the text that was displayed in the dialogue box when the error was hit:

INFO: 11/12/15 12:10:37 PM PST: ----------------------------------
INFO: 11/12/15 12:10:37 PM PST: Exception thrown from action: make
Exception Name: MakefileException
Exception String: Error in invoking target 'install' of makefile '/home/oracle/app/oracle/oms12cr5/Oracle_WT/webcache/lib/'. See '/home/oracle/oraInventory/logs/cloneActions2015-11-12_12-10-18-PM.log' for details.
Exception Severity: 1
INFO: 11/12/15 12:10:37 PM PST: POPUP WARNING:Error in invoking target 'install' of makefile '/home/oracle/app/oracle/oms12cr5/Oracle_WT/webcache/lib/'. See '/home/oracle/oraInventory/logs/cloneActions2015-11-12_12-10-18-PM.log' for details.

Click "Retry" to try again.
Click "Ignore" to ignore this error and go on.
Click "Cancel" to stop this installation.
INFO: 11/12/15 12:20:14 PM PST: The output of this make operation is also available at: '/home/oracle/app/oracle/oms12cr5/Oracle_WT/install/make.log'

The following shows the simple fix:

$ diff ./app/oracle/oms12cr5/Oracle_WT/lib/sysliblist.orig ./app/oracle/oms12cr5/Oracle_WT/lib/sysliblist
< -ldl -lm -lpthread -lnsl -lirc -lipgo --- > -ldl -lm -lpthread -lnsl -lirc -lipgo -ldms2

So if this error hath made at least one googler less crabby I’ll consider this installment in the Little Things Doth Crabby Make series all worth it.

Filed under: oracle

Oracle OpenWorld 2015. Additional Attractions to Consider: OakTable World and EMC Rocks Oracle OpenWorld.

Thu, 2015-10-22 02:17

This is a quick blog entry to to share some information with readers who are attending Oracle OpenWorld 2015.

EMC Rocks Oracle OpenWorld

EMC has a concurrent event at the Elan Event Center (directly across the street from Moscone West) during OpenWorld. This event is a great opportunity to come see the most unique and powerful solutions and products EMC has to offer to folks using Oracle Database. You can register for the event at the following link or just show up with your OpenWorld badge. Link to register for EMC Rocks Oracle OpenWorld. Please visit the link to get more information about the event. I hope to see you there.


Oaktable World 2015

Folks who are aware of the Oaktable Network organization will be pleased to hear that Oaktable World is once again being held concurrently with Oracle OpenWorld. Please visit the following link to get more information about Oaktable Word. You’ll find that no registration is necessary and the speaker list is quite attractive. Link to information about Oaktable World 2015

Shameless Plug

Well, it is my blog after all!  I’ll be delivering another installment of my Modern Platform Topics for Modern DBAs track. This session will show how to use SLOB to study how CPU-intensive your Oracle OLTP-related *waits* are. I’ll also be showing CPU costs associated with key DW/BI/Analytics processing “underpinnings” like scan, filtration and projection on modern 2-socket servers running Linux and Oracle Database 12c. Please join us on Monday October 26 at 3PM as per the Oaktable World schedule (see below or follow the links).


Here is a screenshot of the EMC Rocks Oracle OpenWorld:

Screen Shot 2015-10-21 at 5.02.28 PM

The following is the schedule for technical sessions at EMC Rocks Oracle OpenWorld. I’ve highlighted the XtremIO related sessions since that is the business unit of EMC I work in.

Screen Shot 2015-10-21 at 5.24.24 PM


The following is the Oaktable World schedule:

Screen Shot 2015-10-21 at 4.59.00 PM

Screen Shot 2015-10-21 at 5.28.02 PM


Filed under: oracle

Copy Data Management for Oracle Database with EMC AppSync and XtremIO

Wed, 2015-10-14 12:08

This is a quick blog entry to invite readers to view this little demonstration video I created. The topic is Copy Data Management in an Oracle Database environment. We all know the pains involved with the number of database copies needed in today’s Oracle environment. Well, how about technology with these characteristics:

  1. 100% space efficient. There is no need for any full-copy “donor” in this solution. You can create 8192 XtremIO Virtual Copies of volumes in an XtremIO array and there is no reduction in user-capacity at the storage level. For example, 512 copies of a 1TB volume with Oracle tablespaces in it takes exactly 1TB from the array.
  2. Self service. With EMC AppSync permissions can be set up so that developers can create their own copies, refresh their own copies and expire their own copies.
  3. Speed. AppSync copy operations such as creation and refresh are measured in seconds.
  4. Data Services. All XtremIO Virtual Copies enjoy data reduction services. So as users begin to make changes to their database copies the modified blocks are first treated with de-duplication and then compression.

You more than likely need XtremIO in any cose. However, now it’s also time to think about the ease of provisioning copies of Oracle databases to test/dev and other functions the XtremIO way.

It only takes minutes so please give this a view:

Filed under: oracle

Focusing on Ext4 and XFS TRIM Operations – Part I.

Sun, 2015-07-19 09:29

I’ve been doing some testing that requires rather large file systems. I have an EMC XtremIO Dual X-Brick array from which I provision a 10 terabyte volume. Volumes in XtremIO are always thinly provisioned. The testing I’m doing required me to scrutinize default Linux mkfs(8) behavior for both Ext4 and XFS. This is part 1 in a short series and it is about Ext4.

Discard the Discard Option

The first thing I noticed in this testing was the fantastical “throughput” demonstrated at the array while running the mkfs(8) command with the “-t ext4” option/arg pair. As the following screen shot shows the “throughput” at the array level was just shy of 72GB/s.

That’s not real I/O…I’ll explain…

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

The default options for Ext4 include the discard (TRIM under the covers) option. The mkfs(8) manpage has this to say about the discard option :

Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up filesystem initialization. This is set as default.

I’ve read that quoted text at least eleventeen times but the wording still sounds like gibberish-scented gobbledygook to me–well, except for the bit about significantly speeding up filesystem initialization.

Since XtremIO volumes are created thin I don’t see any reason for mkfs to take action to make it, what, thinner?  Please let me share test results challenging the assertion that the discard mkfs option results in faster file system initialization. This is the default functionality after all.

In the following terminal output you’ll see that the default mkfs options take 152 seconds to make a file system on a freshly-created 10TB XtremIO volume:

# time mkfs -t ext4 /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
real 2m32.055s
user 0m3.648s
sys 0m17.280s

The mkfs(8) Command Without Default Discard Functionality

Please bear in mind that default 152 second result is not due to languishing on pathetic physical I/O. The storage is fast. Please consider the following terminal output where I passed in the non-default -E option with the nodiscard argument. The file system creation took 4.8 seconds:

# time mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

real 0m4.856s
user 0m4.264s
sys 0m0.415s

I think 152 seconds down to 4.8 makes the point that with proper, thinly-provisioned storage the mkfs discard option does not “significantly speed up filesystem initialization.” But initializing file systems is not something one does frequently so investigation into the discard mount(8) option was in order.

Taking Ext4 For A Drive

Since I had this 10TB Ext4 file system–and a fresh focus on file system discard (storage TRIM) features–I thought I’d take it for a drive.

Discarded the Default Discard But Added The Non-Default Discard

While the default mkfs(8) command includes discard, the mount(8) command does not. I decided to investigate this option while unlinking a reasonable number of large files. To do so I ran a simple script (shown below) that copies 64 files of 16 gigabytes each–in parallel–into the Ext4 file system. I then timed a single invocation of the rm(1) command to remove all 64 of these files. Unlinking file in a Linux file system is a metadata operation, however, when the discard option is used to mount the file system each unlink operation includes TRIM operations being sent to storage. The following screen shot of the XtremIO performance dashboard was taken while the rm(1) command was running. The discard mount option turns a metadata operation into a rather costly storage operation.

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

The following terminal output shows the test step sequence used to test the discard mount option:

# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 -o discard /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat &gt; cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
# time sh ./cpit &gt; /dev/null 2&gt;&1 

real 5m31.530s
user 0m2.906s
sys 8m45.292s
# du -sh .
1018G .
# time rm -f file*

real 4m52.608s
user 0m0.000s
sys 0m0.497s

The following terminal output shows the same test repeated with the file system being mounted with the default (thus no discard) mount options:

# cd ..
# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat &gt; cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
# time sh ./cpit &gt; /dev/null 2&gt;&1 

real 5m31.526s
user 0m2.957s
sys 8m50.317s
# time rm -f file*

real 0m16.398s
user 0m0.001s
sys 0m0.750s

This testing shows that mounting an Ext4 file system with the discard mount option dramatically impacts file removal operations. The default mount options (thus no discard option) performed the rm(1) command in 16 seconds whereas the same test took 292 seconds when mounted with the discard mount option.

So how can one perform the important house-cleaning that comes with TRIM operations?

The fstrim(8) Command

Ext4 supports user-invoked, online TRIM operations on mounted file systems. I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command. The following is an example of  how long it takes to execute fstrim on the same 10TB file system stored in an EMC XtremIO array. I think that foregoing the taxation of commands like rm(1) is a good thing–especially since running fstrim is allowed on mounted file systems and only takes roughly 11 minutes on a 10TB file system.

# time fstrim -v /mnt
/mnt: 10908310835200 bytes were trimmed

real 11m29.325s
user 0m0.000s
sys 2m31.370s

If you use thinly-provisioned storage and want file deletion in Ext4 to return space to the array you have a choice. You can choose to take serious performance hits when you create the file system (default mkfs(8) options) and when you delete files (optional discard mount(8) option) or you can occasionally execute the fstrim(8) command on a mounted file system.

Up Next

The next post in this series will focus on XFS.

Filed under: oracle

Announcing “SLOB Recipes”

Fri, 2015-07-17 11:28

I’ve started updating the SLOB Resources page with links to “recipes” for certain SLOB testing. The first installment is the recipe for loading 8TB scale SLOB 2.3 Multiple Schema Model with a 2-Socket Linux host attached to EMC XtremIO. Recipes will include (at a minimum) the relevant SLOB program output (e.g., or, init.ora and slob.conf.

Please keep an eye on the SLOB Resources page for updates…and don’t miss the first installment. It’s quite interesting.


Filed under: oracle

This Is Not Glossy Marketing But You Still Won’t Believe Your Eyes. EMC XtremIO 4.0 Snapshot Refresh For Agile Test / Dev Storage Provisioning in Oracle Database Environments.

Tue, 2015-07-14 19:18

This is just a quick blog post to direct readers to a YouTube video I recently created to help explain to someone how flexible EMC XtremIO Snapshots are. The power of this array capability is probably most appreciated in the realm of provisioning storage for Test and Development environments.

Although this is a silent motion picture I think it will speak volumes–or at least 1,000 words.

Please note: This is just a video demonstration to show the base mechanisms and how they relate to Oracle Database with Automatic Storage Management. This is not a scale demonstration. XtremIO snapshots are supported to in the thousands and extremely powerful “sibling trees” are fully supported.

Not Your Father’s Snapshot Technology

No storage array on the market is as flexible as XtremIO in the area of writable snapshots. This video demonstration shows how snapshots allow the administrator of a “DEV” host–using Oracle ASM–to quickly refresh to current or past versions of ASM disk group contents from the “PROD” environment.

The principles involved in this demonstration are:

  1. XtremIO snapshots are crash consistent.
  2. XtremIO snapshots are immediately created, writeable and space efficient. There is no fixed “donor” relationship. Snapshots can be created from other snapshots and refreshes can go in any direction.
  3. XtremIO snapshot refresh does not involve the host operating system. Snapshot and volume contents can be immediately “swapped” (refreshed) at the array level without any action on the host.

Regarding number 3 on that list, I’ll point out that while the operating system does not play a role in the snapshot operations per se, applications will be sensitive to contents of storage immediately changing. It is only for this reason that there are any host actions at all.

Are Host Operations Involved? Crash Consistent Does Not Mean Application-Coherent

The act of refreshing XtremIO snapshots does not change the SCSI WWN information so hosts do not have any way of knowing the contents of a LUN have changed. In the Oracle Database use case the following must be considered:

  1. With a file system based database one must unmount the file systems before refreshing a snapshot otherwise the file system will be corrupted. This should not alarm anyone. A snapshot refresh is an instantaneous content replacement at the array level. Operationally speaking, file system based databases only require database instance shutdown and the unmounting of the file system in preparation for application-coherent snapshot refresh.
  2. With an ASM based database one must dismount the ASM disk group in preparation for snapshot refresh. To that end, ASM database snapshot restore does not involve system administration in any way.

The video is 5 minutes long and it will show you the following happenings along a timeline:

  1. “PROD” and “DEV” database hosts (one physical and one virtual) each showing the same Oracle database (identical DBID) and database creation time as per dictionary views. This establishes the “donor”<->clone relationship. DEV is a snapshot of PROD. It is begat of a snapshot of a PROD consistency group
  2. A single-row token table called  “test” in the PROD database has value “1.” The DEV database does not even have the token table (DEV is independent of PROD…it’s been changing..but its origins are rooted in PROD as per point #1)
  3. At approximately 41 seconds into the video I take a snapshot of the PROD consistency group with “value 1” in the token table. This step prepares for “time travel” later in the demonstration
  4. I then update the PROD token table to contain the value “42”
  5. At ~2:02 into the video I have already dismounted DEV ASM disk groups and started clobbering DEV with the current state of PROD via a snapshot refresh. This is “catching up to PROD”
    1. Please note: No action at all was needed on the PROD side. The refresh of DEV from PROD is a logical, crash-consistent point in time image
  6. At ~2:53 into the video you’ll see that the DEV database instance has already been booted and that it has value “42” (step #4). This means DEV has “caught up to PROD”
  7. At ~3:32 you’ll see that I use dd(1) to copy the redo LUN over the data LUN on the DEV host to introduce ASM-level corruption
  8. At 3:57 the DEV database is shown as corrupted. In actuality, the ASM disk group holding the DEV database is corrupted
  9. In order to demonstrate traveling back in time, and to recover from the dd(1) corrupting of the ASM disk group,  you’ll see at 4:31 I chose to refresh from the snapshot I took at step #3
  10. At 5:11 you’ll see that DEV has healed from the dd(1) destruction of the ASM disk group, the database instance is booted, and the value in the token table is reverted to 1 (step #3) thus DEV has traveled back in time

Please note: In the YouTube box you can click to view full screen or on if the video quality is a problem:

More Information

For information on the fundamentals of EMC XtremIO snapshot technology please refer to the following EMC paper: The fundamentals of XtremIO snapshot technology

For independent validation of XtremIO snapshot technology in a highly-virtualized environment with Oracle Database 12c please click on the following link: Principled Technologies, Inc Whitepaper

For a proven solution whitepaper showing massive scale data sharing with XtremIO snapshots please click on the following link: EMC Whitepaper on massive scale database consolidation via XtremIO

Filed under: oracle

Announcing SLOB 2.3. Tarry Not, Get It While It’s Hot!

Sun, 2015-07-12 12:48

BLOG UPDATE 2015.07.16: SLOB is now the current version.

This is just a quick post to announce SLOB 2.3. Please visit the SLOB Resources page to download the gzipped tar archive. The SLOB Resources page also has a link the SLOB 2.3 Documentation. SLOB Resources Page: Click Here. New in this release:

  1. The documentation is now also included in the tar archive under SLOB/doc in PDF form.
  2. SLOB 2.3 introduces the SLOB Single Schema feature. Please see the documentation.
  3. Because of SLOB Single Schema the kit now supports SLOB Threads. Note, however, SLOB Threads can be used in either Single or Multiple Schema Model.
  4. SLOB 2.3 has two types of “Hot Spots”
    1. In Multiple Schema Model there are both per-schema Hot Spots and a Hot Schema. Please see the SLOB 2.3 documentation for descriptions of these features.
  5. Improved error handling for both the SLOB Data Loader ( and Test Execution program (
  6. Licensing. Prior releases of SLOB consisted of copyrighted programs with unclear licensing. Please don’t be alarmed. SLOB is still free to use. The LICENSE file defines the word “use.”

Filed under: oracle

SLOB 2.3 User Guide

Fri, 2015-07-10 20:13

SLOB 2.3 is releasing within the next 48 hours. In case anyone wants to read about all the new features here is a link to the SLOB 2.3 User Guide:

SLOB 2.3 User Guide (pdf)


Filed under: oracle

SLOB 2.3 Is Getting Close!

Thu, 2015-05-28 15:30

SLOB 2.3 is soon to be released. This version has a lot of new, important features but also a significant amount of tuning in the data loading kit. Before sharing where the progress is on that front, I’ll quickly list some of the new important features that will be in SLOB 2.3:

  1. Single Schema Support. SLOB historically avoids application-level contention by having database sessions perform the SLOB workload against a private schema. The idea behind SLOB is to exert maximum I/O pressure on storage while utilizing the minimum amount of host CPU possible. This lowers the barrier to entry for proper testing as one doesn’t require dozens of processors festering in transactional SQL code just to perform physical I/O. That said, there are cases where a single, large active data set is desirable–if not preferred. SLOB 2.3 allows one to load massive data sets quickly and run large numbers of SLOB threads (database sessions) to drive up the load on the system.
  2. Advanced Hot Spot Testing. SLOB 2.3 supports configuring each SLOB thread such that every Nth SQL statement operates on a hot spot sized in megabytes as specified in the slob.conf file. Moreover, this version of SLOB allows one to dictate the offset for the hot spot within the active data set. This allows one to easily move the hot spot from one test execution to the next. This sort of testing is crucial for platform experts studying hybrid storage arrays that identify and promote “hot” data into flash for example.
  3. Threaded SLOB. SLOB 2.3 allows one to have either multiple SLOB schemas or the new Single Schema and to drive up the load one can specify how many SLOB threads per schema will be active.


To close out this short blog entry I’ll make note that the SLOB 2.3 data loader is now loading 1TB scale Single Schema in just short of one hour (55.9 minutes exactly). This procedure includes data loading, index creation and CBO statistics gathering. The following was achieved with a moderate IVB-EP 2s20c40t server running Oracle Linux 6.5 and Oracle Database 12c and connected to an EMC XtremIO array via 8GFC Fibre Channel. I think this shows that even the data loader of SLOB is a worthwhile workload in its own right.

SLOB 2.3 Data Loading 1TB/h

Filed under: oracle

Lab Report: Oracle Database on EMC XtremIO. A Compression Technology Case Study.

Tue, 2015-05-26 02:26

If you are interested in array-level data reduction services and how such technology mixes with Oracle Database application-level compression (such as Advanced Compression Option), I offer the link below to an EMC Lab Report on this very topic.

To read the entire Lab Report please click the following link:   Click Here.

The following is an excerpt from the Lab Report:

Executive Summary
EMC XtremIO storage array offers powerful data reduction features. In addition to thin provisioning, XtremIO applies both deduplication and compression algorithms to blocks of data when they are ingested into the array. These features are always on and intrinsic to the array. There is no added licensing, no tuning nor configuration involved when it comes to XtremIO data reduction.

Oracle Database also supports compression. The most common form of Oracle Database compression is the Advanced Compression Option—commonly referred to as ACO. With Oracle Database most “options” are separately licensed features and ACO is one such option. As of the publication date of this Lab Report, ACO is licensed at $11,000 per processor core on the database host1. Compressing Oracle Database blocks with ACO can offer benefits beyond simple storage savings. Blocks compressed with ACO remain compressed as they pass through the database host. In short, blocks compressed with ACO will hold more rows of data per block. This can be either a blessing or a curse. Allowing Oracle to store more rows per block has the positive benefit of caching more application data in main memory (i.e., the Oracle SGA buffer pool). On the other hand, compacting more data into each block often results in increased block-contention.

Oracle offers tuning advice to address this contention in My Oracle Support note 1223705.12. However, the tuning recommendations for reducing block contention with ACO also lower the compression ratios. Oracle also warns users to expect higher CPU overhead with ACO as per the following statement in the Oracle Database product documentation:

Compression technology uses CPU. Ensure that you have enough available CPU to handle the additional load.

Application vendors, such as SAP, also produce literature to further assist database administrators in making sensible choices about how and when to employ Advanced Compression Option. The importance of understanding the possible performance impact of ACO are made quite clear in such publications as SAP Note 14363524 which states the following about SAP performance with ACO:

Overall system throughput is not negatively impacted and may improve. Should you experience very long runtimes (i.e. 5-10 times slower) for certain operations (like mass inserts in BW PSA or ODS tables/partitions) then you should set the event 10447 level 50 in the spfile/init.ora. This will reduce the overhead for insertion into compressed tables/partitions.

The SAP note offers further words of caution regarding transaction logging (a.k.a., redo) in the following quote:

Amount of redo data generated can be up to 30% higher

Oracle Database Administrators, with prior ACO experience, are largely aware of the trade-offs where ACO is concerned. Database Administrators who have customarily used ACO in their Oracle Database deployments may wish to continue to use ACO after adopting EMC XtremIO. For this reason Database Administrators are interested in learning how XtremIO compression and Advanced Compression Option interact.

This Lab Report offers an analysis of space savings with and without ACO on XtremIO. In addition, a performance characterization of an OLTP workload manipulating the same application data in ACO and non-ACO tablespaces will be covered…please click the link above to continue reading…


Filed under: oracle

Whitepaper: Oracle Database 11g and 12c Consolidation and Workload Scalability with EMC XtremIO 3.0

Wed, 2015-04-29 15:30

This is a just a quick blog post to direct readers to the best Oracle-related paper detailing the value EMC XtremIO brings to Oracle Database use cases.  I’ve been looking forward to the availability of this paper for quite some time as I supported (minimally, really) the EMC Global Solutions Engineering group in this effort. They really did a great job with this testing! I highly recommend this paper for readers who are interested in:

  • Leveraging immediate, space efficient, zero overhead storage snapshots for productivity
  • All-Flash Array performance
  • Database workload consolidation

Click the following link to access the whitepaper: click here.   wp-1 Abstract:

This white paper describes the deployment of the XtremIO® all-flash array with Oracle RAC 11g and 12c databases in both physical and virtual environments. It describes optimal performance while scaling up in a physical environment, the effect of adding multiple virtualized database environments, and the impact of using XtremIO Compression with Oracle Advanced Compression. The white paper also demonstrates the physical space efficiency and low performance impact of XtremIO snapshots.

Filed under: oracle Tagged: Oracle Database performance XtremIO flash, Oracle Performance, Random I/O, XtremIO

Adding An EMC XtremIO Volume As An ASM Disk With Oracle Database 12c On Linux – It Does Not Get Any Easier Than This.

Wed, 2015-03-04 19:07
When Something Is Simple It Must Be Simple To Prove

Provisioning high-performance storage has always been a chore. Care and concern over spindle count, RAID type, RAID attributes, number of controller arms involved and a long list of other complexities have burdened storage administrators. Some of these troubles were mitigated by the advent of Automatic Storage Management–but not entirely.

Wouldn’t it be nice if the complexity of storage provisioning could be boiled down to but a single factor? Wouldn’t it be nice if that single factor was, simply, capacity? With EMC XtremIO the only factor storage administrators need to bear in mind when provisioning storage is, indeed, capacity.

With EMC XtremIO a storage administrator hears there is a need for, say, one terabyte of storage and that is the entirety of information needed. No more questions about the I/O pattern (e.g., large sequential writes ala redo logging, etc). The Database Administrator simply asks for capacity with a very short sentence and the Storage Administrator clicks 3 buttons in the XtremIO GUI and that’s all there is to it.

Pictures Speak Thousands of Words

I too enjoy the simplicity of XtremIO in my engineering work. Just the other day I ran short on space in a tablespace while testing Oracle Database 12c intra-node parallel query. I was studying a two-node Real Application Clusters setup attached to an EMC XtremIO array via 8 paths of 8GFC Fibre Channel. The task at hand was a single parallel CTAS (Create Table As Select) but the command failed because my ASM disk group ran out of space when Oracle Database tried to extend the BIGFILE tablespace.

Since I had to add some space I thought I’d take a few screen shots to show readers of this blog how simple it is to perform the full cycle of tasks required to add space to an active cluster with ASM in an XtremIO environment.

The following screen shot shows the error I was reacting to:


Since the following example shows host configuration steps please note the Linux distribution (Oracle Linux) and kernel version (UEK) I was using:


The following screenshot shows the XtremIO GUI configuration tab. I selected “Add” and then typed a name and size (1TB) of the volume I wanted to create:

NOTE: Right click the embedded images for greater clarity


The following screenshot shows how I then selected the initiators (think hosts) from the right-hand column that I wanted to see the new volume:


After I clicked “apply” I could see my new volume in my “12C” folder. With the folder construct I can do things like create zero-overhead, immediate, writable snapshots with a single mouse click. As the following screenshot shows, I highlighted “data5” so I could get details about the volume in advance of performing tasks on the host. The properties tab shows me the only information I need to proceed–the NAA Identifier. Once I had the NAA Identifier I moved on to the task of discovering the new volume on the hosts.



Host Discovery

Host discovery consists of three simple steps:

  1. Multipath discovery
  2. Updating the udev rules file with a text editor
  3. Updating udev state with udevadm commands
Multipath Discovery

On both nodes of the cluster I executed the following series of commands. This series of commands generates a lot of terminal output so I won’t show that in this blog post.

# multipath -F ;service multipathd restart ; -r

After executing the multipath related commands I was able to see the new volume (0002a) on both nodes of the cluster. Notice how the volume has different multipath names (mpathab, mpathai) on the hosts. This is not an issue since the volumes will be controlled by udev:


Updating Udev Rules File and Udev State

After verifying the volumes were visible under DM-MPIO I moved on to the udev actions. The following screenshot shows how I added an ACTION line in the udev rules file and copied it to the other RAC host and then executed the udev update commands on both RAC hosts:


I then could see “/dev/asmdisk6” on both RAC hosts:


Adding The New XtremIO Volume As An ASM Disk

The next task was to use ASMCA (ASM Configuration Assistant) to add the XtremIO volume to the ASM disk group called “DATA”:


As the following screenshot shows the volume is visible as /dev/asmdisk6:


I selected asmdisk6 and the task was complete:


I then saw evidence of ASM rebalancing in the XtremIO GUI Performance tab:




With EMC XtremIO you provision capacity and that allows you to speak in very short sentences with the application owners that share space in the array.

It doesn’t get any easier than this.

Filed under: oracle

Little Things Doth Crabby Make – Part XVIII. Automatic Storage Management Won’t Let Me Use My Disk For My Files! Yes, It Will!

Fri, 2015-02-06 14:52

It’s been a long time since my last installment in the Little Things Doth Crabby Make series and to be completely honest this particular topic isn’t really all that fit for a LTDCM installment because it covers something that is possible but less than expedient.  That said, there are new readers of this blog and maybe it’s time they google “Little Things Doth Crabby Make” to see where this series has been. This post might rustle up that curiosity!

So what is this blog post about? It’s about stuffing any file system file into Automatic Storage Management space. OK, so maybe this is just morbid curiosity or trivial pursuit. Maybe it’s just a parlor trick. I would agree with any of those descriptions. Nonetheless maybe there are 42 or so people out there who didn’t know this. If so, this post is for them.

ASMCMD cp Command

The cp sub-command of ASM lets you stuff certain database files into ASM. We all know this. However, just to make it all fresh in people’s minds I’ll show a screen shot of me trying to push a compressed tar archive of $ORACLE_HOME/bin/oracle up into ASM:


Well, that’s not surprising. But what happens if I take heed of the error message and attempt to placate? The block size is 8KB so the following screen shot shows me rounding up the size of the compressed tar archive to an 8192B blocking factor:


ASMCMD still won’t gobble up the file. That’s still not all that surprising because after ASMCMD checked the geometry of the file it then read the file looking for a header or any file magic it could understand.  As you can see ASMCMD doesn’t see a file type it understands. The following screen shot shows me pre-pending the tar archive with file magic I know ASMCMD must surely understand. I have a database with a tablespace called foo that I created in a non-Oracle Disk Manager naming convention (foo.dbf). The screen shot shows me:

  1. Extracting the foo.dbf file
  2. “Borrowing” 1MB from the head of the file
  3. Creating a compressed tar archive of the Oracle Database executable
  4. Rounding up the size of the compressed tar archive to an 8192B blocking factor



So now I have a file that has the “shape” of a datafile and the necessary header information from a datafile. The next screen shot shows:

  1. ASMCMD cp command pushing my file into ASM
  2. Removal of all of my current working directory files
  3. ASMCMD cp command pulling the file form ASM and into my current working directory
  4. Extracting the contents of the “embedded” tar archive
  5. md5sum(1) proof the file contents survived the journey


OK, so that’s either a) something nobody would ever do or b) something that can be done with some elegant execution of some internal database package in a much less convoluted way or c) a combination of both “a” and “b” or d) a complete waste of my time to post, or, finally, e) a complete waste of your time reading the post. I’m sorry for “a”,”b”,”c” and certainly “e” if the case should be so.

Now you must wonder why I put this in the Little Things Doth Crabby Make series. That’s simple. I don’t like any “file system” imposing restrictions on file types:)


Filed under: oracle

Scrutinizing Exadata X5 Datasheet IOPS Claims…and Correcting Mistakes

Mon, 2015-02-02 19:37

I want to make these two points right out of the gate:

  1. I do not question Oracle’s IOPS claims in Exadata datasheets
  2. Everyone makes mistakes
Everyone Makes Mistakes

Like me. On January 21, 2015, Oracle announced the X5 generation of Exadata. I spent some time studying the datasheets from this product family and also compared the information to prior generations of Exadata namely the X3 and X4. Yesterday I graphed some of the datasheet numbers from these Exadata products and tweeted the graphs. I’m sorry  to report that two of the graphs were faulty–the result of hasty cut and paste. This post will clear up the mistakes but I owe an apology to Oracle for incorrectly graphing their datasheet information. Everyone makes mistakes. I fess up when I do. I am posting the fixed slides but will link to the deprecated slides at the end of this post.

We’re Only Human

Wouldn’t IT be a more enjoyable industry if certain IT vendors stepped up and admitted when they’ve made little, tiny mistakes like the one I’m blogging about here? In fact, wouldn’t it be wonderful if some of the exceedingly gruesome mistakes certain IT vendors make would result in a little soul-searching and confession? Yes. It would be really nice! But it’ll never happen–well, not for certain IT companies anyway. Enough of that. I’ll move on to the meat of this post. The rest of this article covers:

  • Three Generations of Exadata IOPS Capability
  • Exadata IOPS Per Host CPU
  • Exadata IOPS Per Flash SSD
  • IOPS Per Exadata Storage Server License Cost
Three Generations of Exadata IOPS Capability

The following chart shows how Oracle has evolved Exadata from the X3 to the X5 EF model with regard to IOPS capability. As per Oracle’s datasheets on the matter these are, of course, SQL-driven IOPS. Oracle would likely show you this chart and nothing else. Why? Because it shows favorable,  generational progress in IOPS capability. A quick glance shows that read IOPS improved just shy of 3x and write IOPS capability improved over 4x from the X3 to X5 product releases. These are good numbers. I should point out that the X3 and X4 numbers are the datasheet citations for 100% cached data in Exadata Smart Flash Cache. These models had 4 Exadata Smart Flash Cache PCIe cards in each storage server (aka, cell). The X5 numbers I’m focused on reflect the performance of the all-new Extreme Flash (EF) X5 model. It seems Oracle has started to investigate the value of all-flash technology and, indeed, the X5 EF is the top-dog in the Exadata line-up. For this reason I choose to graph X5 EF data as opposed to the more pedestrian High Capacity model which has 12 4TB SATA drives fronted with PCI Flash cards (4 per storage server). exadata-evolution-iops-gold-1 The tweets I hastily posted yesterday with the faulty data points aimed to normalize these performance numbers to important factors such as host CPU, SSD count and Exadata Storage Server Software licensing costs.  The following set of charts are the error-free versions of the tweeted charts.

Exadata IOPS Per Host CPU

Oracle’s IOPS performance citations are based on SQL-driven workloads. This can be seen in every Exadata datasheet. All Exadata datasheets for generations prior to X4 clearly stated that Exadata IOPS are limited by host CPU. That is a very important fact to understand because SQL-driven IOPS is a host metric no matter what your storage is.

Indeed, anyone who studies Oracle Database with SLOB knows how all of that works. SQL-driven IOPS requires host CPU. Sadly, however, Oracle ceased stating the fact that IOPS are host-CPU bound in Exadata as of the advent of the X4 product family. I presume Oracle stopped correctly stating the factual correlation between host CPU and SQL-driven IOPS for only the most honorable of reasons with the best of customers’ intentions in mind.

In case anyone should doubt my assertion that Oracle historically associated Exadata IOPS limitations with host CPU I submit the following screen shot of the pertinent section of the X3 datasheet:   X3-datasheet-truth Now that the established relationship between SQL-driven IOPS and host CPU has been demystified, I’ll offer the following chart which normalizes IOPS to host CPU core count: exadata-evolution-iops-per-core-gold I think the data speaks for itself but I’ll add some commentary. Where Exadata is concerned, Oracle gives no choice of host CPU to customers. If you adopt Exadata you will be forced to take the top-bin Xeon SKU with the most cores offered in the respective Intel CPU family.  For example, the X3 product used 8-core Sandy Bridge Xeons. The X4 used 12-core Ivy Bridge Xeons and finally the X5 uses 18-core Haswell Xeons. In each of these CPU families there are other processors of varying core counts at the same TDP. For example, the Exadata X5 processor is the E5-2699v3 which is a 145w 18-core part. In the same line of Xeons there is also a 145w 14c part (E5-2697v3) but that is not an option to Exadata customers.

All of this is important since Oracle customers must license Oracle Database software by the host CPU core. The chart shows us that read IOPS per core from X3 to X4 improved 18% but from X4 to X5 we see only a 3.6% increase. The chart also shows that write IOPS/core peaked at X4 and has actually dropped some 9% in the X5 product. These important trends suggest Oracle’s balance between storage plumbing and I/O bandwidth in the Storage Servers is not keeping up with the rate at which Intel is packing cores into the Xeon EP family of CPUs. The nugget of truth that is missing here is whether the 145w 14-core  E5-2697v3 might in fact be able to improve this IOPS/core ratio. While such information would be quite beneficial to Exadata-minded customers, the 22% drop in expensive Oracle Database software in such an 18c versus 14c scenario is not beneficial to Oracle–especially not while Oracle is struggling to subsidize its languishing hardware business with gains from traditional software.

Exadata IOPS Per Flash SSD

Oracle uses their own branded Flash cards in all of the X3 through X5 products. While it may seem like an implementation detail, some technicians consider it important to scrutinize how well Oracle leverages their own components in their Engineered Systems. In fact, some customers expect that adding significant amounts of important performance components, like Flash cards, should pay commensurate dividends. So, before you let your eyes drift to the following graph please be reminded that X3 and X4 products came with 4 Gen3 PCI Flash Cards per Exadata Storage Server whereas X5 is fit with 8 NVMe flash cards. And now, feel free to take a gander at how well Exadata architecture leverages a 100% increase in Flash componentry: exadata-evolution-iops-per-SSD-gold This chart helps us visualize the facts sort of hidden in the datasheet information. From Exadata X3 to Exadata X4 Oracle improved IOPS per Flash device by just shy of 100% for both read and write IOPS. On the other hand, Exadata X5 exhibits nearly flat (5%) write IOPS and a troubling drop in read IOPS per SSD device of 22%.  Now, all I can do is share the facts. I cannot change people’s belief system–this I know. That said, I can’t imagine how anyone can spin a per-SSD drop of 22%–especially considering the NVMe SSD product is so significantly faster than the X4 PCIe Flash card. By significant I mean the NVMe SSD used in the X5 model is rated at 260,000 random 8KB IOPS whereas the X4 PCIe Flash card was only rated at 160,000 8KB read IOPS. So X5 has double the SSDs–each of which is rated at 63% more IOPS capacity–than the X4 yet IOPS per SSD dropped 22% from the X4 to the X5. That means an architectural imbalance–somewhere.  However, since Exadata is a completely closed system you are on your own to find out why doubling resources doesn’t double your performance. All of that might sound like taking shots at implementation details. If that seems like the case then the next section of this article might be of interest.

IOPS Per Exadata Storage Server License Cost

As I wrote earlier in this article, both Exadata X3 and Exadata X4 used PCIe Flash cards for accelerating IOPS. Each X3 and X4 Exadata Storage Server came with 12 hard disk drives and 4 PCIe Flash cards. Oracle licenses Exadata Storage Server Software by the hard drive in X3/X4 and by the NVMe SSD in the X5 EF model. To that end the license “basis” is 12 units for X3/X5 and 8 for X5. Already readers are breathing a sigh of relief because less license basis must surely mean less total license cost. Surely Not! Exadata X3 and X4 list price for Exadata Storage Server software was $10,000 per disk drive for an extended price of $120,000 per storage server. The X5 EF model, on the other hand, prices Exadata Storage Server Software at $20,000 per NVMe SSD for an extended price of $160,000 per Exadata Storage Server. With these values in mind feel free to direct your attention to the following chart which graphs the IOPS per Exadata Storage Server Software list price (IOPS/license$$). exadata-evolution-iops-per-license-cost-gold The trend in the X3 to X4 timeframe was a doubling of write IOPS/license$$ and just short of a 100% improvement in read IOPS/license$$. In stark contrast, however, the X5 EF product delivers only a 57% increase in write IOPS/license$$ and a troubling, tiny, 17% increase in read IOPS/license$$. Remember, X5 has 100% more SSD componentry when compared to the X3 and X4 products.


No summary needed. At least I don’t think so.

About Those Faulty Tweeted Graphs

As promised, I’ve left links to the faulty graphs I tweeted here: Faulty / Deleted Tweet Graph of Exadata IOPS/SSD: Faulty / Deleted Tweet Graph of Exadata IOPS/license$$:


Exadata X3-2 datasheet: Exadata X4-2 datasheet: Exadata X5-2 datasheet: X4 SSD info: X5 SSD info: Engineered Systems Price List: ,

Filed under: oracle