RE: Solid State Drives

From: Matthew Zito <mzito_at_gridapp.com>
Date: Fri, 1 May 2009 13:58:55 -0400
Message-ID: <C0A5E31718FC064A91E9FD7BE2F081B10201B7B3_at_exchange.gridapp.com>


Well, so, unfortunately today is a little too busy for me to go back through and track down some of the really good nuts-and-bolts level discussions that have been going on in the storage community concerning the lifespan of SSDs vs. traditional hard drives. I'll see if I have some time on the plane this weekend to put together an aggregate discussion and put it on the list.

However, as I recall, there are a couple things that people are doing to minimize the issue, above and beyond the vanilla wear leveling: - Reserved blocks - all of the high-end SSD devices have a percentage of reserved blocks that do not appear visible to the OS as usable blocks. This way, as blocks begin to fail, they can be seamlessly paged out to reserve blocks. This will mitigate the edge case where some blocks start to fail earlier than others, due to the magic of manufacturing defects or other fun
- Write cache - most of the folks implementing SSD support are tweaking their algorithms to attempt to manage the write process. For example, a filesystem with a 4kb blocksize and a DB with an 8kb blocksize may be on an SSD that uses 128KB sector sizes internally. A naive implementation would allow a user to sequentially write 16 8KB blocks, not realizing that would generate 16 write cycles on the same block on the SSD. Arrays are batching writes, even in high I/O environments where typically cache pressure would cause immediate destage, to match to the blocksize of the SSD.
- Drive-level write cache - the better SSDs have a much larger write cache than a traditional disk, and they use battery backup to protect that in case of power failure. This way, high-volume or write-read-overwrite model writes don't necessarily need to immediately generate a write cycle on a given block.

The other thing to consider is that even if you have 10GB of active redo logs, the odds are high that the array you're using will be striping that write workload across at least two much larger (several hundred GB) devices. In addition, since there's no rotational latency penalty, it is very feasible to mix workloads, such that there's 20GB allocated for the redo logs, then 150GB for an archive log dest, and the writes will get leveled across the whole set of flash chips inside the drive.

Finally, the reality is that two traditional disks dedicated to redo logs with high I/O volumes will fail faster than drives that are a traditional mixed-workload regardless. And while your failure rates may be higher in those scenarios, the same truths exist when it comes to hot spares, etc. as in "vanilla" environments. Your vendor will eat some of the costs in higher failure rates in certain environments, same as they do today with regular disks.

Thanks,
Matt

-----Original Message-----
From: Jeremy Schneider [mailto:jeremy.schneider_at_ardentperf.com] Sent: Friday, May 01, 2009 1:24 PM
To: Matthew Zito
Cc: andrew.kerber_at_gmail.com; dofreeman_at_state.pa.us; oracle-l_at_freelists.org
Subject: Re: Solid State Drives

Matthew Zito wrote:
>
> As far as the upgrade path, the lifespan is comparable for a "spinning
> rust" hard drive.
>

I'm curious if this is actually true? (What is it based on?) I would think that lifespan would be dependent on I/O patterns (because of hardware I/O leveling) -- and filesystem vs redo logs could be very different access patterns. In particular, redo could easily pound every single block on a smaller SSD (hardware leveling becomes fairly meaningless), which is rather different from a filesystem where some blocks may not get accessed that heavily. I'm not sure one way or the other, just something I've been wondering about.

Similarly, if you mirrored two of them for redo then isn't there a high likelyhood that they would wear out around the same time?

-Jeremy

-- 
Jeremy Schneider
Chicago, IL
http://www.ardentperf.com

--
http://www.freelists.org/webpage/oracle-l
Received on Fri May 01 2009 - 12:58:55 CDT

Original text of this message