Re: I/O performance

From: Karl Arao <karlarao_at_gmail.com>
Date: Thu, 21 Jun 2012 14:36:03 -0500
Message-ID: <CACNsJnetoRiviPwiuQ=Pg5fVxjXcv0BH8oZx67juE0KUqzryCg_at_mail.gmail.com>

I've got a couple of points here..
Calibrate IO -

Yes I agree with calibrate io doing 8k first then large reads at the end of it.. and if you increase the "num_physical_disks" to 128 or some value way larger than your number of disks it will do a longer sustained IO workload.. which you can see here
https://lh6.googleusercontent.com/-sOsWu7Pic6Y/T-Np0FfGMTI/AAAAAAAABp0/LgvdL6-kF8A/s2048/20120621_calibrateio.PNG that's
a run of 8,16, and 128 num_physical_disks. The 128 reached the max bandwidth of the storage.

Short Stroking -

Also on your earlier reply, short stroking the disk is really helpful in IO performance.. and that's what is Exadata is actually doing when it is allocating cell disks out of the outer layer. I've got an R&D server with 8 x 1TBdisks and short stroked it to the 320GB outer layer but it took me a while to get the short stroke sweet spot http://www.facebook.com/photo.php?fbidH9469633028&lfe0cb72e I believe I used HD Tach in here but did a bunch of test case in Linux as I grow the area size of the disk and it also depends on the size of the data area that you need.. so I laid out mine as 320GB outer x 8 for the DATA ASM, 320GB x8 for the 2nd outer for the LVM that I striped for my VirtualBox guests, and the rest for my RECO area which I used for backups.

Stripe size -

Aside from short stroking the disks, the larger stripe size I used for my LVM the greater the performance I get from the sequential reads and writes # VBOX
pvcreate /dev/sda6 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3 /dev/sdg3 /dev/sdh3
vgcreate vgvbox /dev/sda6 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3 /dev/sdg3 /dev/sdh3
lvcreate -n lvvbox -i 8 -I 4096 vgvbox -l 625008 <-- so this is striped at 8 disks which will behave like ASM where it writes 4M chunks at each Physical Volume so when you do IO operations all of your spindles are working
mkfs.ext3 /dev/vgvbox/lvvbox

and check out the 1MB and 4MB stripe size comparison here #### 320GB LVM STRIPE 1MB VS 4MB ON UEK $ less 16_ss_320GB_LVM_lvvm-1MBstripe/orion.log | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS(6.61
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPS(5.16 _at_
Small=0 and Large%6 <---- at 1MB stripe
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSF0.69
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSE9.27 _at_
Small=0 and Large%6 <---- at 1MB stripe
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSu4 _at_
Small%6 and Large=0 Minimum Small Latency38.54 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSs5 _at_
Small%6 and Large=0 Minimum Small Latency47.24 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS01 _at_
Small%6 and Large=0 Minimum Small Latency2.61 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS96 _at_
Small%6 and Large=0 Minimum Small Latency3.24 _at_ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS14.26 _at_ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSy1 _at_ Small and
Large=0 Minimum Small Latency.51 _at_ Small=1 and Large=0 oracle_at_desktopserver.local:/reco/orion:dw

$ less 15_ss_320GB_LVM_lvvm-4MBstripe/orion.log | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS(3.53
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPSB6.95 _at_
Small=0 and Large%6 <---- at 4MB stripe
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSF2.84
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSa4.10 _at_
Small=0 and Large%6 <---- at 4MB stripe
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSu3 _at_
Small%6 and Large=0 Minimum Small Latency38.63 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSs1 _at_
Small%6 and Large=0 Minimum Small Latency49.22 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS98 _at_
Small%6 and Large=0 Minimum Small Latency3.03 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS06 _at_
Small%6 and Large=0 Minimum Small Latency1.97 _at_ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS15.68 _at_ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSy2 _at_ Small and
Large=0 Minimum Small Latency.42 _at_ Small=1 and Large=0 oracle_at_desktopserver.local:/reco/orion:dw

You see the difference of 614.10 MB/s to 459.27 MB/s on sequential reads.. that's a lot!
You'll see more details in here --> LVMstripesize,AUsize,UEKkernel test case comparison -
http://www.evernote.com/shard/s48/sh/36636b46-995a-4812-bd07-e88fa0dfd191/d36f37565243025e7b5792f496dc5a37

UEK vs Regular Kernel -

And not only that.. I noticed that when I used a UEK kernel it gives me more MB/s on sequential reads and writes which possible because of kernel optimizations
http://www.oracle.com/us/technologies/linux/uek-for-linux-177034.pdf

#### NON-UEK VS UEK ON LVM - *the regular kernel gives lower MB/s on sequential reads/writes*

$ cat 23_ss_320GB_LVM_lvvbox-4MBstripe-regularkernel/orion.log | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS%8.40
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPS43.02 _at_
Small=0 and Large%6 <---- regular kernel
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSA3.60
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSU0.17 _at_
Small=0 and Large%6 <---- regular kernel
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSs4 _at_
Small%6 and Large=0 Minimum Small Latency47.84 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSq6 _at_
Small%6 and Large=0 Minimum Small Latency56.26 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS45 _at_
Small%6 and Large=0 Minimum Small Latency0.21 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS40 _at_
Small%6 and Large=0 Minimum Small Latency0.91 _at_ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS10.54 _at_ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSx0 _at_ Small and
Large=0 Minimum Small Latency.47 _at_ Small=1 and Large=0 oracle_at_desktopserver.local:/reco/orion:dw

$ less 15_ss_320GB_LVM_lvvm-4MBstripe/orion.log | grep +++ | grep -v RUN
+++ localhost.localdomain params_dss_randomwrites Maximum Large MBPS(3.53
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqwrites Maximum Large MBPSB6.95 _at_
Small=0 and Large%6 <---- UEK kernel
+++ localhost.localdomain params_dss_randomreads Maximum Large MBPSF2.84
_at_ Small=0 and Large%6
+++ localhost.localdomain params_dss_seqreads Maximum Large MBPSa4.10 _at_
Small=0 and Large%6 <---- UEK kernel
+++ localhost.localdomain params_oltp_randomwrites Maximum Small IOPSu3 _at_
Small%6 and Large=0 Minimum Small Latency38.63 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqwrites Maximum Small IOPSs1 _at_
Small%6 and Large=0 Minimum Small Latency49.22 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_randomreads Maximum Small IOPS98 _at_
Small%6 and Large=0 Minimum Small Latency3.03 _at_ Small%6 and Large=0
+++ localhost.localdomain params_oltp_seqreads Maximum Small IOPS06 _at_
Small%6 and Large=0 Minimum Small Latency1.97 _at_ Small%6 and Large=0
+++ localhost.localdomain params_dss Maximum Large MBPS15.68 _at_ Small=0
and Large
+++ localhost.localdomain params_oltp Maximum Small IOPSy2 _at_ Small and
Large=0 Minimum Small Latency.42 _at_ Small=1 and Large=0 oracle_at_desktopserver.local:/reco/orion:dw

ASM redundancy / SAN redundancy -

I'll pull a conversation I had with a really good friend of mine.. he's question was "Quick question on your AWR mining script AWR-gen-wl..., is IOPs calculated before or after ASM mirroring? For example on Exadata, if I see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal redundancy)?"

and here's my response..

The script awr_iowl.sql and awr_iowlexa.sql has that columns that accounts for RAID1.. that is read penalty of 1 and write penalty of 2. ****

read on the section "the iops raid penalty" on this link http://www.zdnetasia.com/calculate-iops-in-a-storage-array-62061792.htm and the "real life examples" on this link
http://www.yellow-bricks.com/2009/12/23/iops/****

so those computations should also apply for Exadata since normal redundancy is essentially RAID1 that's write penalty of 2, and the high redundancy is penalty of 3. ****

Now I remember this sizing exercise I had with an EMC engineer on a project bid before
https://www.evernote.com/shard/s48/sh/03602b99-3274-4c64-b5d1-bbe7bd961f8d/95be02ccf9aa75cf863bb19115353eb0

and that's why I created those columns to get the data directly from AWR.. so for every snapshot you've got the "hardware iops needed" and "number of disks needed", what's good about that is as your workload vary those two numbers if representative of that workload. So since you have a lot of data samples, I usually make a histogram on those two columns and get the top percentile numbers because most likely those are my peak pereiods and I can investigate it by drilling down on the snap_ids and looking into the SQLs and validating it to the app owners as to what's the application is running at that time. ****

I've attached an excel sheet which you can just plug the total workload iops on the yellow box. So in your case, let's say you have 10K workload IOPS... that's equivalent to 15K hardware IOPS for normal redundancy and 20K hardware IOPS for high redundancy.

the excel screenshot is actually here ----> https://lh6.googleusercontent.com/-00PkzwfwnOE/T-N0oo2Q-FI/AAAAAAAABqE/EbTOnHBlpmQ/s2048/20120621_IOPS.png

Note that I'm particular with the words "workload IOPS" and "hardware IOPS" **** so on this statement ****

*if I see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal redundancy)?* <-- if this 10,000 is what you pulled from the AWR then it's the database that did the 10K IOPS so that's the "workload IOPS".. and that's essentially your "IO workload requirements". ****

Then let's say you haven't migrated to Exadata.. you have to take into account the penalty computation shown above.. so you'll arrive with 15000 "hardware IOPS" needed (normal redundancy).. and say each disk has IOPS of 180 then you need at least 83 disks so that's 83disks / 12 disks each cell = 6.9 storage cells ... and that's Half Rack Exadata. But looking at the data sheet
https://www.dropbox.com/s/ltvr7caysvfmvkr/dbmachine-x2-2-datasheet-175280.pdf it
seems like you can fit the 15000 on a quarter rack (because of the flash).. mmm.. well I'm not pretty confident with that because if let's say 50% of 15000 IOPS are writes (7.5K IOPS) then I would investigate on the IOPS write mix if most of it are DBWR related (v$sysstat.physical write IO requests) or LGWR related (v$sysstat.redo writes) and if most of it are DBWR related then I don't think you'll ever benefit from the smart flash log. So I would still go with the Half (12500 disk IOPS) or Full Rack (25000 disk IOPS) for my "hardware IOPS" capacity. And I'll also take into consideration the MB/s needed for that database but that should be augmented by the flash cache. ****

The effect of ASM redundancy on read/write IOPS � SLOB test case!

I'm currently writing a blog post about this, but I'll give you a bits of stuff right now.. So my statement above is true. The ASM redundancy or parity affects the workload write IOPS number but it will not affect the workload read IOPS.

as you can see here as I change redundancy the read IOPS stayed at the range of 2400+ IOPS
128R
https://lh4.googleusercontent.com/-QEFEQkc3iy4/T-Npy9FRfpI/AAAAAAAABpk/VfvcxgN9D0k/s2048/20120621_128R.png

while on the writes, as I move to "normal" redundancy it went down to half and "high" redundancy it went down to 1/3 128W
https://lh4.googleusercontent.com/-H7q6OpJnhRA/T-Npy8t5DDI/AAAAAAAABpo/jd2_Cp4exAc/s2048/20120621_128W.png

This behavior is the same even on a regular SAN environment... which you have to be careful/aware when sizing storage.

-- 
Karl Arao
karlarao.wordpress.com
karlarao.tiddlyspot.com

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Jun 21 2012 - 14:36:03 CDT

This message: [ Message body ]
Next message: Tim Gorman: "Re: (oracle q000) Streams AQ: waiting for time management or cleanup tasks fall out."
Previous message: GG: "Re: (oracle q000) Streams AQ: waiting for time management or cleanup tasks fall out."
In reply to: Allen, Brandon: "RE: I/O performance"
Next in thread: Niall Litchfield: "Re: I/O performance"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message