Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: Raid 50

Re: Raid 50

From: Matthew Zito <mzito_at_gridapp.com>
Date: Thu, 8 Jul 2004 16:29:02 -0400
Message-Id: <737A7008-D11D-11D8-9E6E-000393D3B578@gridapp.com>

Comments inline.

--
Matthew Zito
GridApp Systems
Email: mzito_at_gridapp.com
Cell: 646-220-3551
Phone: 212-358-8211 x 359
http://www.gridapp.com


On Jul 8, 2004, at 10:33 AM, Craig I. Hagan wrote:
<snip>

> Next, your statement talks about reads, which don't have the stripe
> width
> problem (just chunk size/individual disk) save when operating in
> degraded mode
> and a read is performed against data on the failed disk. Raid5 isn't
> all that
> bad for random reads -- it is just that most random read system also
> come with
> random writes which you didn't address.
>
> this leaves you with two sets of io possibilities (one if the array's
> minimum
> io size is a stripe):
>
> 1) read just the chunk(s) requested if the data being read is less than
> stripe width and no drives have failed
>
> send io to sub-disk(s), return result
>
> NB: this is comparable to raid1 (one iop per disk)
While it is technically true that its comparable to RAID-1 in a "single-read" environment, it has a very different performance profile against RAID-1 when there are multiple IOPs. Most sane/reasonable/modern RAID-1 implementations allow for "detached reads" on sides of a mirror - where the two sides of the mirror can service two independent reads at a time. The even more intelligent implementations are smart enough to look at where the read heads are of the various disks and make a determination about which head is closest to the incoming read request. This creates a huge random read performance boost over RAID-5, especially for small I/Os. With truly random workloads, the disks will settle into a territorial system, each disk basically servicing half the block address space. Of course, every time there's a write, both heads seek to the same point, so the system resets itself.
>
> 2) read the entire strip
> if drives have failed:
> read stripe's chunks from surviving subdisks. unless chunk w/ crc
> has failed, use it to compute missing data
<snip>
> You discussed reads earlier, which is an area that raid5 often does
> quite well
> at. Writes can be a different matter. In order to achieve writes the
> size of
> stripes is to issue the to the OS either as a single large write, or
> (for
> OSes/storage which are smart enough to coalesce) a series of adjacent
> smaller
> writes.

> When your submitted writes are less than stripe size and are random so
> no
> coalescing can be performed (think oltp with blocksize < stripesize),
> then you
> will see this:
>
> read stripe in.
> modify the 8k region
> compute checksum for stripe
> write out to disks
>
> This requires two operations against all disks in the set as well as a
> checksummer computation. This is inferior to raid1 which would have
> emitted one
> iop to each disk. This is a major reason why raid5 isn't chosen for
> truly
> random io situations unless the sustained writes are below that which
> can be
> sunk to disk and the cache can handle the burst workload.
Okay, there are a couple of problems with this scenario you describe. I'm not sure by what you mean by "stripe" versus "chunk". The terminology is universally a little amorphous, but it seems as though you mean a "stripe" to be "the same stripe column across every disk", while a "chunk" is "a single stripe column on a single disk". I'm not sure if this is how you intended it, but either way, the situation described is not really accurate. The particular phrase I have issue with is, "this requires two operations against all disks in the set". If it is a normal/san RAID-5 implementation, and the size of the I/O is such that it fits within a single stripe column on a single disk, only the parity disk for that column and the data disk are involved in the I/O. The complete *logical* flow for this operation is: -accept incoming data -read current contents of data column off disk -XOR current contents of data column against incoming data - cache this new value -Write new data to disk (not the XOR'ed data from the last step, the original incoming data block) -read current contents of parity column off disk -XOR cached XOR value from step 3 against parity column -Write new parity column to disk -Acknowledge write to host So, only two disks are involved - the parity disk and the lone data disk. All of the other disks in the RAID group are free to carry on with their business (unless, of course, the RAID set is degraded). The performance of this can be improved greatly through a couple of different mechanisms - first, a large percentage of the writes in normal environments are "read->modify->write", which makes it likely the original unmodified block will be in cache already, so no disk I/O is needed. If its not in cache, the least elegant way to complete the transaction is to read the existing block into cache. Newer SCSI/FC drives, though, have support for XOR commands - XDRWRITE, for example, has the controller send the drive the new data block. The drive then internally reads the existing block, XORs the new vs. the current, writes the new block to disk, and returns the XORed value to the controller. This actually reduces the number of reads/writes for RAID-5 down to the same as RAID-1, though the writes have to be consecutive, not concurrent, which does mean a performance hit over RAID-1.
> This is an area where raid5 tends to do quite well -- often better
> than raid1
> pair because you're splitting the load across more disks (similar # of
> iops)
> rather than duplicating it (write speed of raid1 pair == write speed
> of single
> disk).
If the writes are coalesced, the speed could be roughly equivalent to a RAID-1 pair, since the I/Os are sequential, and hence optimal for a single-disk effective I/O - the question is the size of the RAID group. The larger the RAID group, the larger the stripe, and the more equivalent data would have to be written to a single spindle in a RAID-1 group. In many cases, though, any slight performance gain seen by the fact there is less work for each disk to do is offset by the fact that the parity I/O cannot be completed until the data blocks are written, plus the fact that the data needs to be XOR'ed before being written to parity. Thanks, Matt ---------------------------------------------------------------- Please see the official ORACLE-L FAQ: http://www.orafaq.com ---------------------------------------------------------------- To unsubscribe send email to: oracle-l-request_at_freelists.org put 'unsubscribe' in the subject line. -- Archives are at http://www.freelists.org/archives/oracle-l/ FAQ is at http://www.freelists.org/help/fom-serve/cache/1.html -----------------------------------------------------------------
Received on Thu Jul 08 2004 - 15:24:50 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US