Re: sequential disk read speed

From: Brian Selzer <>
Date: Sat, 30 Aug 2008 15:32:48 -0400
Message-ID: <Wfhuk.19458$>

"David BL" <> wrote in message

> On Aug 30, 9:14 am, "Brian Selzer" <> wrote:

>> "David BL" <> wrote in message
>> >On Aug 29, 7:47 pm, "Brian Selzer" <> wrote:
>> >> "David BL" <> wrote in message
>> >>
>> [snip]
>> > > You're oversimplifying. With a stripe size of 64K, it is highly
>> > > unlikely
>> > > that a leaf node will span more than one stripe; therefore, it is
>> > > highly
>> > > unlikely for every drive to contribute to the reading of every leaf
>> > > node.
>> > I don't see how I'm oversimplifying.
>> > My point is that stripes need to be at least as coarse as the DBMS
>> > block size. Do you agree?
>> Yes, I think the stripe size should be a multiple of the DBMS block size.
>> > The choice of DBMS block size is another question entirely.
>> > > Also, you appear to be discounting concurrency, and environments
>> > > where
>> > > concurrency is important such as typical OLTP environments are where
>> > > technologies such as elevator seeking are most effective.
>> > Concurrency has nothing to do with the fact that if the stripe size is
>> > too small the seeking of the drives won't be independent.
>> That wasn't why I brought up concurrency. You dismissed elevator seeking
>> as
>> an optimization mechanism with respect to the number of seeks required by
>> focusing on what appeared to be a single query. Ultimately the same
>> number
>> of seeks will be required for a particular query, but when combined with
>> ninety-nine other queries, some of those seeks can be shared with other
>> queries, thus reducing the total number of seeks required to satisfy the
>> hundred.
> I'm not sure how I implied that elevator seeking isn't worthwhile.

You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries.

>> > > By the way, Oracle documentation states that an 8K block size is
>> > > optimal
>> > > for
>> > > most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8
>> > > =
>> > > 64K.
>> > > Interestingly, Sql Server uses 8K pages organized into 64K extents,
>> > > which
>> > > happens to be the unit of physical storage allocation. Do you know
>> > > something they don't?
>> > Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql
>> > Server 7.0 released in 1998. Do you expect that 64k extents are still
>> > optimal a decade later given that the product of transfer rate and
>> > seek time has been steadily increasing?
>> Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle
>> documentation that cited an 8K block size as being optimal was for their
>> latest version, 11g.
>> I think that 64K extents are still optimal because the technology
>> employed
>> for serializing updates is still locking, and in a concurrent environment
>> with an escalating locking heirachy, a page size or extent size that is
>> too
>> large will cause transactions to block more often. Do you know of a
>> sound
>> and practicable alternative to an escalating locking heirarchy for
>> serializing updates?

> I think these systems lock pages not extents, and anyway locking
> granularity can in principle be orthogonal to the unit of I/O or unit
> of allocation.

I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O.

>> > 64k blocks are generally too small on modern disks. A 64k block can
>> > be transferred in a tenth of the time it takes to seek to it.
>> Why is that a problem? Isn't it more efficient if I can satisfy the same
>> query by reading 100 64K blocks instead of 100 1M blocks?

> Yes, but a DBMS will often be able to satisfy a given query by reading
> fewer blocks.   For example full table scans are much more efficient
> with 1M blocks.  Also, if you increase the block size by 10x then the
> height of a B+Tree will tend to smaller.  For example a tree of height
> 3 may be able to index a trillion rather than only a billion records.

I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query.

> The advantages of a larger block size are more apparent in a database
> storing data where there is a greater tendency for locality based on
> affinity to be useful.  For example, it would be rather silly to use
> 64k blocks to store multi-resolution terra pixel images.

It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it. Received on Sat Aug 30 2008 - 21:32:48 CEST

Original text of this message