RE: To use SAME or NOT for High End Storage Setup ? .... StripeUnit Size 32 MB Vs. 64 KB ?

Home -> Community -> Mailing Lists -> Oracle-L -> RE: To use SAME or NOT for High End Storage Setup ? .... StripeUnit Size 32 MB Vs. 64 KB ?

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Mon, 15 May 2006 12:35:51 -0400
Message-ID: <KNEIIDHFLNJDHOOCFCDKCECAIBAA.mwf@rsiz.com>

At this point I think reviewing the input data to the question is in order:

Quoting VIVEK_SHARMA <VIVEK_SHARMA_at_infosys.com>:

> Folks
>
>
>
> 1) IBM is recommending SAME (Stripe across all the 46 LUNs) + 2 separate
> LUNs for online redo logfiles
>
>
>
> 2) IBM is recommending 32 MB Stripe Unit size across the 46 LUNs using
> Volume Manager.
>
> NOTE - Each underlying LUN has 8 Disks (Hardware Raid 1+0 with Stripe
> Unit Size 64 KB - This is NOT changeable)
>
> Qs. Any feedback on impact of 32 MB Stripe Unit Size(across LUNs) on
> Performance of OLTP / Batch Transactions?
>

I *think* this means that a LUN has 4 pairwise mirrored disks and that the stripe width is 256KB (4*64K)
at the hardware level. At the individual disk level the 64K is made up in the viewpoint of the OS as 128 512byte sectors, so unless you're consistently perfectly aligned with the disk blocks, you've got a small chance that any individual multiblock read request that actually reaches the spinning platters will stay on a single head. If you're reading from the array cache it probably doesn't matter much, and the penalty you'll pay to physical disk is gated by how well (and if) the hardware overlaps seeks within a hardware managed LUN.

Since you wrote 32 MB stripe unit size across the 46 LUNs using the Volume Manager, I *think* this means a stripe WIDTH of 32MB*46, which is huge.(If you meant a stripe WIDTH of 32 MB, then your stripe unit size is a little less than 3 quarters of a MB.) I also *think* this means that if an object is 16MB in size, then it has a 50-50 chance of living in one chunk on a single LUN, and likewise a 50-50 chance of being in 2 pieces on two LUNs. Small objects compared to 32 MB will reside in 1 or 2 LUNs in this configuration, while big objects compared to 32 MB will be spread across more LUNs.

So if you have hot objects that are relatively small, you've got a good chance they'll be on 1 or 2 LUNs, and it won't take too much bad luck to get several small hot objects on the same LUN.

There is a good chance that any such potential hot spots are handled in the cache, because they only pertain to relatively small objects. Those objects will be in cache if they are hot (unless of course they are hot with regard to writing).

But you're still only going to see the write degradation if the hot objects on a single LUN overrun cache and the ability of four drives to keep up DB_WRITER. Then again, I'm not sure what overhead you incur if a single read references multiple LUNs, anyway. At that point it is software volume manager, and it is
not clear to me whether there is any different penalty from referencing two physical platters from a single LUN versus the last platter of one LUN and the
first platter of the next LUN. That would depend on whether the hardware is capable of overlapping seeks and the chain of software reaching the platter triggers that capability.

If I'm wrong, we need to clarify your meaning of the terminology as regards "stripe unit size" with regard to both the hardware LUN creation and the volume
manager creation of volumes as seen at the file system level or raw partition level.

I'm also curious whether you're re-mirroring on the 46 LUNs. I'm guessing NOT, but I could take your meaning that way from the indicating that you're using SAME across the 46 LUNs. I *think* you've mirrored pairwise at the underlying hardware level and you're simply creating volumes striped across 46 of the resulting LUNs.

Finally, if the stripe WIDTH across the 46 LUNs is 32 MB, that could work out very well with the first LUN in each logical stripe rotating as the volume
manager creates the storage. That is *if* there is any overhead to cross LUN reads within a logical volume. The other way to do this is to make each LUN be presented as 4 LUNs, rotating the starting drive on each quarter. Again, with large cache you probably will never see the difference, but in the old days
of actually depending on reading and writing from the physical platters, my observation was that the first drive of a raid set tended to get beat on by Oracle,
so spliting up a raid set into what I defined as "stripe sets" with using a round robin allocation of the "first" platter in each stripe set made quite a difference.
When Oracle handled much smaller volumes, you had to split up the raid sets into multiple volumes to use them raw anyway, and the overhead to rotating the starting point was just thinking of it (if the volume manager supported telling it where to start), so it was worth it if there was any chance of a gain.

I apologize for writing too much. I didn't have time to make it shorter.

Kevin's comment about 32KB being a trainwreck is probably an UNDERSTATEMENT. Kevin's comment about the "the planets do not align that way" is exactly correct, and I think Cary and others have written whole papers on the math of
what lines up usefully and when, which I vastly oversimplified above, just giving you the 50-50 point.

Regards,

mwf

-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]On Behalf Of Kevin Closson
Sent: Monday, May 15, 2006 10:49 AM
To: oracle-l_at_freelists.org
Subject: RE: To use SAME or NOT for High End Storage Setup ? .... StripeUnit Size 32 MB Vs. 64 KB ?

 >>>array. In both cases performance was fine and there were no "hot"


>>>disks. The logic for choosing 4 MB was to ensure that any

>>>full table scans (in our case 128 KB) would avoid having the

>>>multi-block reads split into two reads due to the required
>>>blocks existing in more than 1 "stripe".

a 4MB stripe width will reduce the odds there will be cross-stripe reads, but in no way eliminates it. The planets do not align that way.

>>>
>>>IMHO I'm not sure why IBM are recommending why you should go
>>>as large as 32 MB.

I'm still surprised to hear there is such a thing as a 32MB stripe width on a DSXXXX array... maybe they meant 32KB (which would be a trainwreck) ?

--
http://www.freelists.org/webpage/oracle-l


--
http://www.freelists.org/webpage/oracle-l

Received on Mon May 15 2006 - 11:35:51 CDT