Re: sequential disk read speed

From: David BL <davidbl_at_iinet.net.au>
Date: Fri, 29 Aug 2008 07:23:35 -0700 (PDT)
Message-ID: <6463ace9-eddf-4889-8e65-17d070220a94_at_t1g2000pra.googlegroups.com>


On Aug 29, 7:47 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> news:f6ef3678-c7e9-4cd2-acaf-13cac28819d6_at_a1g2000hsb.googlegroups.com...
>
>
>
>
>
> >On Aug 28, 9:34 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >>news:40d67c8b-d516-4721-a52d-20579c2ca9ac_at_r35g2000prm.googlegroups.com...
>
> >> > On Aug 28, 10:47 am, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> >> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >> >>news:b3a7632f-de18-46e8-8ce3-3c5aaf83d4b9_at_a3g2000prm.googlegroups.com...
>
> >> >> >> , and since there are four disks, the average seek
> >> >> >> time for the disk subsystem is reduced to a quarter of that or
> >> >> >> roughly
> >> >> >> .625ms.
>
> >> >> > In order for the effective seek time to be reduced to a quarter the
> >> >> > seeking must be independent. To achieve that I think the striping
> >> >> > would need to be very coarse (eg 512kb or 1Mb).
>
> >> >> Drives that support disconnection or some other command queueing
> >> >> mechanism
> >> >> are all that is needed for seeking to be independent.
>
> >> > If stripes are somewhat smaller than the DBMS block size, then every
> >> > drive (in the RAID 0) will be involved in the reading of each and
> >> > every DBMS block. No matter how you order those reads, each drive
> >> > needs to read a large amount of scattered data and the head will seek
> >> > around a lot. If that is the case then the only advantage arises
> >> > from your previously mentioned reduction in the overall range of
> >> > tracks over which the data resides on a given platter.
>
> >> > Alternatively if the stripe size is larger then each drive will read a
> >> > somewhat independent set of the DBMS blocks, and the effective seek
> >> > time can be reduced assuming the DBMS is able to issue overlapping
> >> > read requests for the DBMS blocks.
>
> >> Your argument rests on the assumption that data is randomly distributed
> >> in
> >> the stripes on the disk and doesn't take into account the fact that a
> >> high-end caching controller eliminates latency by reading an entire track
> >> at
> >> once. Isn't it true that there is a physical affinity between related
> >> data?
> >> Isn't it more likely that an index will occupy contiguous stripes than
> >> some
> >> random set--regardless of stripe size? Can you show that the number of
> >> tracks accessed by say, 128 coarse stripe reads is any less than the
> >> number
> >> of tracks accessed by 1024 fine stripe reads?
>
> >Yes, sometimes the DBMS manages to cluster all the necessary data so
> >there is very little seeking required, and in that case it won’t
> >matter what stripe size is used.
>
> >However, that is not always possible. For example consider a B+Tree
> >on 1 billion records and in a short period of time the DBMS needs to
> >read 100 records for given index values that are effectively at random
> >with respect to the ordering on that data type. To keep it simple
> >ignore the reading of the internal nodes of the B+Tree. Typically
> >those 100 records will appear in roughly 100 different leaf nodes of
> >the B+Tree. Furthermore due to the sheer size of the overall data
> >those leaf nodes will tend to reside on different tracks. The
> >unfortunate reality is that it isn’t possible to read these records
> >without a lot of head seeking, even if the reads are ordered according
> >to track position (ie elevator seeking). Now if RAID0 is used and the
> >stripes are smaller that the B+Tree leaf nodes, then every drive will
> >need to contribute to the reading of every leaf node. Each drive can
> >read the stripes in any order it likes but it won’t avoid the fact
> >that each drive performs ~100 seeks. If instead, each B+Tree leaf
> >node resides in a single stripe (and therefore on a single drive) then
> >with four drives in the RAID0, each drive will only need to perform
> >~25 seeks.
>
> You're oversimplifying. With a stripe size of 64K, it is highly unlikely
> that a leaf node will span more than one stripe; therefore, it is highly
> unlikely for every drive to contribute to the reading of every leaf node.

I don't see how I'm oversimplifying.

My point is that stripes need to be at least as coarse as the DBMS block size. Do you agree?

The choice of DBMS block size is another question entirely.

> Also, you appear to be discounting concurrency, and environments where
> concurrency is important such as typical OLTP environments are where
> technologies such as elevator seeking are most effective.

Concurrency has nothing to do with the fact that if the stripe size is too small the seeking of the drives won't be independent.

> By the way, Oracle documentation states that an 8K block size is optimal for
> most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K.
> Interestingly, Sql Server uses 8K pages organized into 64K extents, which
> happens to be the unit of physical storage allocation. Do you know
> something they don't?

Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql Server 7.0 released in 1998. Do you expect that 64k extents are still optimal a decade later given that the product of transfer rate and seek time has been steadily increasing?

64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it. Received on Fri Aug 29 2008 - 16:23:35 CEST

Original text of this message