Re: Raid 50

From: Matthew Zito <mzito_at_gridapp.com>
Date: Thu, 8 Jul 2004 16:29:02 -0400
Message-Id: <737A7008-D11D-11D8-9E6E-000393D3B578@gridapp.com>
Comments inline.
--
Matthew Zito
GridApp Systems
Email: mzito_at_gridapp.com
Cell: 646-220-3551
Phone: 212-358-8211 x 359
http://www.gridapp.com


On Jul 8, 2004, at 10:33 AM, Craig I. Hagan wrote:
<snip>


> Next, your statement talks about reads, which don't have the stripe 

> width

> problem (just chunk size/individual disk) save when operating in 

> degraded mode

> and a read is performed against data on the failed disk. Raid5 isn't 

> all that

> bad for random reads -- it is just that most random read system also 

> come with

> random writes which you didn't address.

>

> this leaves you with two sets of io possibilities (one if the array's 

> minimum

> io size is a stripe):

>

> 1) read just the chunk(s) requested if the data being read is less than

> 	stripe width and no drives have failed

>

> 	send io to sub-disk(s), return result

>

> 	NB: this is comparable to raid1 (one iop per disk)


While it is technically true that its comparable to RAID-1 in a 
"single-read" environment, it has a very different performance profile 
against RAID-1 when there are multiple IOPs.  Most 
sane/reasonable/modern RAID-1 implementations allow for "detached 
reads" on sides of a mirror - where the two sides of the mirror can 
service two independent reads at a time.  The even more intelligent 
implementations are smart enough to look at where the read heads are of 
the various disks and make a determination about which head is closest 
to the incoming read request.

This creates a huge random read performance boost over RAID-5, 
especially for small I/Os.  With truly random workloads, the disks will 
settle into a territorial system, each disk basically servicing half 
the block address space.  Of course, every time there's a write, both 
heads seek to the same point, so the system resets itself.



>

> 2) read the entire strip

> 	if drives have failed:

> 	read stripe's chunks from surviving subdisks. unless chunk w/ crc

> 	has failed, use it to compute missing data

<snip>


> You discussed reads earlier, which is an area that raid5 often does 

> quite well

> at. Writes can be a different matter. In order to achieve writes the 

> size of

> stripes is to issue the to the OS either as a single large write, or 

> (for

> OSes/storage which are smart enough to coalesce) a series of adjacent 

> smaller

> writes.




> When your submitted writes are less than stripe size and are random so 

> no

> coalescing can be performed (think oltp with blocksize < stripesize), 

> then you

> will see this:

>

> read stripe in.

> modify the 8k region

> compute checksum for stripe

> write out to disks

>

> This requires two operations against all disks in the set as well as a

> checksummer computation. This is inferior to raid1 which would have 

> emitted one

> iop to each disk. This is a major reason why raid5 isn't chosen for 

> truly

> random io situations unless the sustained writes are below that which 

> can be

> sunk to disk and the cache can handle the burst workload.


Okay, there are a couple of problems with this scenario you describe.  
I'm not sure by what you mean by "stripe" versus "chunk".  The 
terminology is universally a little amorphous, but it seems as though 
you mean a "stripe" to be "the same stripe column across every disk", 
while a "chunk" is "a single stripe column on a single disk".  I'm not 
sure if this is how you intended it, but either way, the situation 
described is not really accurate.

The particular phrase I have issue with is, "this requires two 
operations against all disks in the set".  If it is a normal/san RAID-5 
implementation, and the size of the I/O is such that it fits within a 
single stripe column on a single disk, only the parity disk for that 
column and the data disk are involved in the I/O.  The complete 
*logical*  flow for this operation is:

-accept incoming data
-read current contents of data column off disk
-XOR current contents of data column against incoming data - cache this 
new value
-Write new data to disk (not the XOR'ed data from the last step, the 
original incoming data block)
-read current contents of parity column off disk
-XOR cached XOR value from step 3 against parity column
-Write new parity column to disk
-Acknowledge write to host

So, only two disks are involved - the parity disk and the lone data 
disk.  All of the other disks in the RAID group are free to carry on 
with their business (unless, of course, the RAID set is degraded).  The 
performance of this can be improved greatly through a couple of 
different mechanisms - first, a large percentage of the writes in 
normal environments are "read->modify->write", which makes it likely 
the original unmodified block will be in cache already, so no disk I/O 
is needed.  If its not in cache, the least elegant way to complete the 
transaction is to read the existing block into cache.  Newer SCSI/FC 
drives, though, have support for XOR commands - XDRWRITE, for example, 
has the controller send the drive the new data block.  The drive then 
internally reads the existing block, XORs the new vs. the current, 
writes the new block to disk, and returns the XORed value to the 
controller.

This actually reduces the number of reads/writes for RAID-5 down to the 
same as RAID-1, though the writes have to be consecutive, not 
concurrent, which does mean a performance hit over RAID-1.




> This is an area where raid5 tends to do quite well -- often better 

> than raid1

> pair because you're splitting the load across more disks (similar # of 

> iops)

> rather than duplicating it (write speed of raid1 pair == write speed 

> of single

> disk).


If the writes are coalesced, the speed could be roughly equivalent to a 
RAID-1 pair, since the I/Os are sequential, and hence optimal for a 
single-disk effective I/O - the question is the size of the RAID group. 
  The larger the RAID group, the larger the stripe, and the more 
equivalent data would have to be written to a single spindle in a 
RAID-1 group.    In many cases, though, any slight performance gain 
seen by the fact there is less work for each disk to do is offset by 
the fact that the parity I/O cannot be completed until the data blocks 
are written, plus the fact that the data needs to be XOR'ed before 
being written to parity.

Thanks,
Matt
  
  

----------------------------------------------------------------
Please see the official ORACLE-L FAQ: http://www.orafaq.com
----------------------------------------------------------------
To unsubscribe send email to:  oracle-l-request_at_freelists.org
put 'unsubscribe' in the subject line.
--
Archives are at http://www.freelists.org/archives/oracle-l/
FAQ is at http://www.freelists.org/help/fom-serve/cache/1.html
-----------------------------------------------------------------
Received on Thu Jul 08 2004 - 15:24:50 CDT