RAID Reliability Calculations (Was Storage array advice anyone?)

From: <chris_at_thedunscombes.f2s.com>
Date: Thu, 6 Jan 2005 10:38:14 +0000
Message-ID: <1105007894.41dd15166458a@webmail.freedom2surf.net>

Well it's the new year now and I've completed the data loss calculations to "update" the calculations from section 3.4.5 of the paper "RAID: High Performance, Reliable Secondary Storage" mentioned by Cary.

http://www.eecs.umich.edu/CoVirt/papers/diskArraySurvey.pdf

First a quick summary of the original results:

Double disk failure

285 years mean time to data loss (MTTDL)
3.4% probability of data loss over 10 years (PDL10Y)

Disk failure + unrecoverable read error (bit error) during reconstruction of failed disk

36 years mean time to data loss (MTTDL)
24.4% probability of data loss over 10 years (PDL10Y)

Based on:

500 GB of data, 5 GB drives with 200,000 hrs MTTF, 16 disks per RAID set
1 unrecoverable bit error per 10^14 bits read

Now with only 8 disks per RAID set:

Double disk failure                     - 571 years MTTDL,  1.7% PDL10Y

Disk failure + unrecoverable read error - 71 years MTTDL, 13.1% PDL10Y

Finally 2 disks per RAID set i.e. mirroring:

Double disk failure                     - 2,283 years MTTDL, 0.44% PDL10Y

Disk failure + unrecoverable read error - 285 years MTTDL, 3.44% PDL10Y

Now the figures using a modern Seagate Cheetah 15K.4 Ultra SCSI 320 drive (http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah15k.4.pdf)

Based on:

10 TB of data, 72 GB drives with 1,400,000 hrs MTTF
1 unrecoverable bit error per 10^15 sectors read

8 disks per RAID set:

Double disk failure                     -    20,173 years MTTDL, 0.050% PDL10Y

Disk failure + unrecoverable read error - 1,023,650 years MTTDL, 0.001% PDL10Y

2 disks per RAID set i.e. mirroring:

Double disk failure                     -    80,584 years MTTDL, 0.0124% PDL10Y

Disk failure + unrecoverable read error - 4,094,597 years MTTDL, 0.0002% PDL10Y

All the calculations assume a mean time to repair (MTTR) i.e. reconstruct failed disk, of 1 hour and a correlated disk error factor of 10. These are the figures used in the original paper so that we are always comparing "apples with apples" as far as possible.

I've ignored the case of a system crash followed by a disk failure mentioned in the original paper as that applies to software RAID and not hardware RAID with non volatile cache storage as exists in all modern medium-highend RAID solutions.

Also I've not used the Harmonic sum approach found in the original paper as I was unable to work exactly what was being done.

I hope some people find this useful, it helps to provide some science towards the question of RAID reliability and is certainly much better than my original statement along the lines of:

"you'd have to be very unlucky to suffer data loss with a modern RAID 5 solution"

I used a basic Excel spreadsheet to do the calculations which I've put up on my company's website (to avoid clogging up the list server) so if anyone is interested in looking further at the calculations or using different parameters e.g. disk MTTF etc. then pls download the spreadsheet and use however you see fit.

http://www.christallize.com/download/diskfailurecalc.xls

Of course I provide no warranty, support or have any liability etc. on whatever you may use the spreadsheet for.

Cheers,

Chris Dunscombe

Christallize Ltd

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Jan 06 2005 - 04:33:29 CST