Re: sequential disk read speed

From: Brian Selzer <brian_at_selzer-software.com>
Date: Fri, 29 Aug 2008 07:47:54 -0400
Message-ID: <LlRtk.38668$ZE5.1987_at_nlpi061.nbdc.sbc.com>


"David BL" <davidbl_at_iinet.net.au> wrote in message news:f6ef3678-c7e9-4cd2-acaf-13cac28819d6_at_a1g2000hsb.googlegroups.com...
>On Aug 28, 9:34 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
>> "David BL" <davi..._at_iinet.net.au> wrote in message
>>
>> news:40d67c8b-d516-4721-a52d-20579c2ca9ac_at_r35g2000prm.googlegroups.com...
>>
>>
>>
>>
>>
>> > On Aug 28, 10:47 am, "Brian Selzer" <br..._at_selzer-software.com> wrote:
>> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>>
>> >>news:b3a7632f-de18-46e8-8ce3-3c5aaf83d4b9_at_a3g2000prm.googlegroups.com...
>>
>>
>> >> >> , and since there are four disks, the average seek
>> >> >> time for the disk subsystem is reduced to a quarter of that or
>> >> >> roughly
>> >> >> .625ms.
>>
>> >> > In order for the effective seek time to be reduced to a quarter the
>> >> > seeking must be independent. To achieve that I think the striping
>> >> > would need to be very coarse (eg 512kb or 1Mb).
>>
>> >> Drives that support disconnection or some other command queueing
>> >> mechanism
>> >> are all that is needed for seeking to be independent.
>>
>> > If stripes are somewhat smaller than the DBMS block size, then every
>> > drive (in the RAID 0) will be involved in the reading of each and
>> > every DBMS block. No matter how you order those reads, each drive
>> > needs to read a large amount of scattered data and the head will seek
>> > around a lot. If that is the case then the only advantage arises
>> > from your previously mentioned reduction in the overall range of
>> > tracks over which the data resides on a given platter.
>>
>> > Alternatively if the stripe size is larger then each drive will read a
>> > somewhat independent set of the DBMS blocks, and the effective seek
>> > time can be reduced assuming the DBMS is able to issue overlapping
>> > read requests for the DBMS blocks.
>>
>> Your argument rests on the assumption that data is randomly distributed
>> in
>> the stripes on the disk and doesn't take into account the fact that a
>> high-end caching controller eliminates latency by reading an entire track
>> at
>> once. Isn't it true that there is a physical affinity between related
>> data?
>> Isn't it more likely that an index will occupy contiguous stripes than
>> some
>> random set--regardless of stripe size? Can you show that the number of
>> tracks accessed by say, 128 coarse stripe reads is any less than the
>> number
>> of tracks accessed by 1024 fine stripe reads?
>
>Yes, sometimes the DBMS manages to cluster all the necessary data so
>there is very little seeking required, and in that case it won’t
>matter what stripe size is used.
>
>However, that is not always possible. For example consider a B+Tree
>on 1 billion records and in a short period of time the DBMS needs to
>read 100 records for given index values that are effectively at random
>with respect to the ordering on that data type. To keep it simple
>ignore the reading of the internal nodes of the B+Tree. Typically
>those 100 records will appear in roughly 100 different leaf nodes of
>the B+Tree. Furthermore due to the sheer size of the overall data
>those leaf nodes will tend to reside on different tracks. The
>unfortunate reality is that it isn’t possible to read these records
>without a lot of head seeking, even if the reads are ordered according
>to track position (ie elevator seeking). Now if RAID0 is used and the
>stripes are smaller that the B+Tree leaf nodes, then every drive will
>need to contribute to the reading of every leaf node. Each drive can
>read the stripes in any order it likes but it won’t avoid the fact
>that each drive performs ~100 seeks. If instead, each B+Tree leaf
>node resides in a single stripe (and therefore on a single drive) then
>with four drives in the RAID0, each drive will only need to perform
>~25 seeks.
>

You're oversimplifying. With a stripe size of 64K, it is highly unlikely that a leaf node will span more than one stripe; therefore, it is highly unlikely for every drive to contribute to the reading of every leaf node. Also, you appear to be discounting concurrency, and environments where concurrency is important such as typical OLTP environments are where technologies such as elevator seeking are most effective.

By the way, Oracle documentation states that an 8K block size is optimal for most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K. Interestingly, Sql Server uses 8K pages organized into 64K extents, which happens to be the unit of physical storage allocation. Do you know something they don't?

>> >> I think using a coarse stripe is counterproductive. There would be a
>> >> bigger
>> >> chance that a seek in the middle of the read would be required.
>> >> Consider:
>> >> if 3.5 stripes fit on a track in one zone of the disk, then on average
>> >> every
>> >> fourth read would require an additional seek to get the remaining half
>> >> stripe. If on the other hand, 28 stripes fit on a track, then no
>> >> additional
>> >> seeks would be necessary. Even if it were 28.5 stripes instead of 28,
>> >> one
>> >> additional seek for every 29 reads is a whole lot better than one for
>> >> every
>> >> 4.
>>
>> > Firstly, hard-disks are quite good at stepping onto the next track in
>> > the manner normally used for very large "contiguous" reads or writes.
>>
>> The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for
>> writes. That's phenomenal but can still add up.
>
>It’s insignificant when reading or writing 1Mb at a time.
>
Received on Fri Aug 29 2008 - 13:47:54 CEST

Original text of this message