Re: sequential disk read speed

From: Brian Selzer <>
Date: Thu, 28 Aug 2008 09:34:47 -0400
Message-ID: <hQxtk.19226$>

"David BL" <> wrote in message
> On Aug 28, 10:47 am, "Brian Selzer" <> wrote:
>> "David BL" <> wrote in message
>> > On Aug 24, 12:39 pm, "Brian Selzer" <> wrote:
>> >> If you have a 100GB database and you put it on single
>> >> 100GB disk drive, your best average seek time is the average seek time
>> >> of
>> >> the disk drive, but if you put the database on four 100GB disk drives,
>> >> the
>> >> the best average seek time will only be a fraction of the seek time of
>> >> the
>> >> single disk. Suppose that the full-stroke seek time on the 100GB disk
>> >> is
>> >> 7ms and the track-to-track seek time is 1ms. Well, with four disks,
>> >> instead
>> >> of an average 4ms seek time, the individual seek time of each disk is
>> >> reduced to roughly 2.5ms
>> > Is this because less of the disk is actually being used so on a given
>> > platter the head doesn't have such a large range of tracks to move
>> > over?
>> Yes. And the bit density is generally greater at the outside of the
>> platter, so it generally takes fewer tracks to store the same information
>> there as opposed to near the center; consequently, simply dividing the
>> difference of the full-stroke seek and the track-to-track seek by four is
>> a
>> perhaps overly conservative method of estimation. I want to stress that
>> this is not just a hair-brained theory of mine: I've had significant
>> success
>> using this mechanism to boost performance. In one application, by
>> installing a disk that was seven times the size required and creating a
>> partition on the outer edge of the disk, performance improved by over
>> 6000%:
>> batch processes that had been taking over 25 hours to complete were
>> finishing in under 25 minutes.
> How do you explain a 60 fold increase?

Fewer and shorter seeks is my guess.

>> >> , and since there are four disks, the average seek
>> >> time for the disk subsystem is reduced to a quarter of that or roughly
>> >> .625ms.
>> > In order for the effective seek time to be reduced to a quarter the
>> > seeking must be independent. To achieve that I think the striping
>> > would need to be very coarse (eg 512kb or 1Mb).
>> Drives that support disconnection or some other command queueing
>> mechanism
>> are all that is needed for seeking to be independent.
> If stripes are somewhat smaller than the DBMS block size, then every
> drive (in the RAID 0) will be involved in the reading of each and
> every DBMS block. No matter how you order those reads, each drive
> needs to read a large amount of scattered data and the head will seek
> around a lot. If that is the case then the only advantage arises
> from your previously mentioned reduction in the overall range of
> tracks over which the data resides on a given platter.
> Alternatively if the stripe size is larger then each drive will read a
> somewhat independent set of the DBMS blocks, and the effective seek
> time can be reduced assuming the DBMS is able to issue overlapping
> read requests for the DBMS blocks.

Your argument rests on the assumption that data is randomly distributed in the stripes on the disk and doesn't take into account the fact that a high-end caching controller eliminates latency by reading an entire track at once. Isn't it true that there is a physical affinity between related data? Isn't it more likely that an index will occupy contiguous stripes than some random set--regardless of stripe size? Can you show that the number of tracks accessed by say, 128 coarse stripe reads is any less than the number of tracks accessed by 1024 fine stripe reads?

>> I think using a coarse stripe is counterproductive. There would be a
>> bigger
>> chance that a seek in the middle of the read would be required.
>> Consider:
>> if 3.5 stripes fit on a track in one zone of the disk, then on average
>> every
>> fourth read would require an additional seek to get the remaining half
>> stripe. If on the other hand, 28 stripes fit on a track, then no
>> additional
>> seeks would be necessary. Even if it were 28.5 stripes instead of 28,
>> one
>> additional seek for every 29 reads is a whole lot better than one for
>> every
>> 4.
> Firstly, hard-disks are quite good at stepping onto the next track in
> the manner normally used for very large "contiguous" reads or writes.

The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for writes. That's phenomenal but can still add up.

> Secondly your analysis misses the point that coarser granularity
> stripes lead to fewer overall seeks, not more! Seeks per read is
> not a very useful stat.

Coarser granularity stripes lead to fewer overall reads, not necessarily fewer overall seeks--and not necessarily reduced overall seek time.

A finer granularity means more commands to be processed. More commands to be processed increases the likelyhood that the read of one track will satisfy more than one command. More commands to be processed increases the likelyhood that elevator seeking can be used to reduce overall seek time.

Coarser granularity stripes lead to fewer overall reads--not necessarily fewer overall seeks. In fact, it could lead to more overall seeks. Suppose, for example, that many of the stripes on disk are less than half populated with data--in much the same way that a FAT16 files system with a huge number of tiny files can fill up the disk even though the sum of the actual file sizes can be less than a quarter of the formatted capacity. Any seek that is needed in order to read the rest of a stripe when the rest of the stripe isn't populated with data would be unnecessary if a smaller stripe size were used. In much the same way, with a high-end processor, it is often possible to improve performance by setting the compressed attribute on a file. A compressed file typically occupies half the space of an uncompressed file, and with a high-end cpu, it can actually take less time to read and uncompress data than to read uncompressed data.

> The following
> discusses the ideal stripe size and the formulae indicate that ~1Mb
> would be appropriate for a modern disk.

I am not convinced, knowing what I know and have had experience with when it comes to storage subsystems. I would have to read the papers Mike Ault vaguely refers to. Received on Thu Aug 28 2008 - 15:34:47 CEST

Original text of this message