RE: Solid State Drives

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Sat, 2 May 2009 08:46:29 -0400
Message-ID: <F79772792ACE41279EE0EE8C64331DDA_at_rsiz.com>



We seem to have adopted an SSD==Flash assumption on this thread. Given the faster cost drop in flash than in other types of solid state memory, that may be appropriate. Still there are other choices and whether they are economic or not going forward, it has long been the case that if you really needed isolated throughput to persistent storage for modest sized particular objects (like redo logs in a high update Oracle database or UNDO space against a load of many transaction and long queries), then superior media choices to spinning rust were available. I suggest reading Kevin Closson's fine contributions to that topic to avoid being disappointed by the real achievable throughput improvements as compared to the ratio of the service time differential between traditional disk farms and SSD. Kevin's analysis of where you hit other bottlenecks in the total throughput picture is spot on. His mission of debunking hyperbole in this area is to my observation scientifically complete.

I have long held that the biggest throughput per dollar spent improvement due to selective use of SSD in an economic deployment is not the straight out acceleration of i/o to the objects on the SSD (yes, Virginia, there is still acceleration, it is just within the limits of other bottlenecks, not the magical sounding ratio of the device speeds and math calculation address speed seek time in place of mechanical seeks), but rather in the "deheating" of the rest of the disk farm. As spinning rust sizes have grown and the drop in cost per unit storage, ie. dollars per terabyte, has been truly impressive, the cost per spindle drop has been much less impressive. So isolating a few spindles to segregate the really hot i/o from the rest of the farm is often more expensive now than segregating the really hot i/o to some flavor of SSD that meets or exceeds the mean time between failure of traditional disk.

Especially when paired with stripe and mirror everything on the rest of the disk farm, this removal of hot interrupting i/o from the disk farm reduces service times for everything else. And it reduces wear and tear on the mechanical traditional disk farm components.

I think Tanel's analysis of the duration of wear out is on target, as to the shape of wearout patterns for "flash" SSD. Knowing how much reserve is built into a given manufacturer's "flash" SSD offering and whether it provides a routine utility to tell you how much reserve remains free is also useful. If you track your peak i/o requirement and verify your free margin, you will have plenty of insurance in scheduling migration to new plexes. Unlike the crash of spinning rust, degradation of flash is incremental and can be watched.

I confess I didn't read this entire thread, but I've consistently found Poder, Zito, Morle, and Closson to be speakers of the truth whose detailed experiments accurately predict the results when you apply the resulting logical suggestions to the construction of an Oracle Server complex. Apologies in advance to those I've left out.

I wonder how long it will be before the best economic solution for storage is completely non-mechanical? I just hope we don't "Rollerball" the 14th century. Probably not a concern, there will probably be 2.6 billion copies of everything, including all classified material in cloud, and on the moon and Mars, too.

Regards,

mwf

-----Original Message-----

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Tanel Poder
Sent: Friday, May 01, 2009 2:34 PM
To: jeremy.schneider_at_ardentperf.com; mzito_at_gridapp.com Cc: andrew.kerber_at_gmail.com; dofreeman_at_state.pa.us; oracle-l_at_freelists.org Subject: RE: Solid State Drives

Well even without wear levelling and copy-on-writes, assuming you have loaded the SSD 100% full with redo logs only then you could write into this redolog space many times.

So if we do an exercise assuming that:

  1. you have 8 x 1 GB redologs for a database on a SSD
  2. it takes 1 hour to fill these 8 GB of logs ( 8 GB per minute = 192 GB per 24 hours ) - you will be writing to the same block once per an hour (two times to redo header though as its updated when it gets full to mark the end SCN in the file)
  3. its possible to write to the SSD disk block for "only" 100,000 times

So, if you write to a block max 2 times per hour it would still mean 50,000 hours it would still be over 5 years.

The controlfiles and temp tablespace files (and sometimes undo) experience much more writes to "same" blocks compared to redologs, they would be the first ones hitting problems :)

But there IS the write levelling which avoids writing to the same blocks too much by physically writing somewhere else and updating the virtual/physical location translation table. Much depends on the algorithm used...

Regarding whether the mirrored SSDs would wear out at the same time - probably not as the number of writes before wearing out is not some fixed discrete number it will probably vary quite a lot. And you would not want to wait until these disks fail anyway, but rather replace them before known "expire time". And this expire time would be measured in number of write operations rather than wall clock time.

--

Regards,
Tanel Poder
http://blog.tanelpoder.com

--

http://www.freelists.org/webpage/oracle-l Received on Sat May 02 2009 - 07:46:29 CDT

Original text of this message