Re: sequential disk read speed

From: Brian Selzer <>
Date: Mon, 1 Sep 2008 09:00:13 -0400
Message-ID: <yHRuk.19573$>

"David BL" <> wrote in message

> >
> > > I'm not sure how I implied that elevator seeking isn't worthwhile.
> >
> > You didn't, and I didn't say that you did. But your implication that it
> > doesn't affect the number of seeks, while true, is an oversimplification
> > in
> > that it doesn't take into account that in a concurrent environment, many
> > of
> > those seeks can be shared by other queries.
> If seeks have a reasonable probability of being shared, doesnít that
> necessarily mean the size of memory is approaching the size of
> secondary storage, and a large read cache will achieve the same
> effect?
> Iíve heard of techniques like buffer trees that take buffering of
> reads and writes to an extreme level, allowing a system to get much
> closer to the theoretical optimum for I/O. However such techniques
> donít seem compatible with OLTP where transactions are using strict
> 2PL and need to be completed quickly in order release locks. I can
> see that MVCC will provide far more opportunity for long running read
> only transactions to share reads (and seeks), but I doubt whether it
> would allow much sharing of reads when the size of memory is only a
> tiny fraction of the total size of the database.

I think you would be surprised. In a typical OLTP database, only a tiny fraction of the total size of the database is being manipulated. Most accesses involve what happened yesterday and what is happening today. Most of the remainder involve what is happening this month and possibly what happened last month. For example, in a manufacturing system, those manufacturing orders that are currently running in the plant are accessed most often. Those orders that ran yesterday are the next most often queried, followed by those that have run so far this month and then last month. Only an occasional query will look at older information--perhaps looking at the last time a particular part was produced, for example.

A bigger block size will cause a lot of information to be transferred to memory that isn't required to answer any queries.


> > > I think these systems lock pages not extents, and anyway locking
> > > granularity can in principle be orthogonal to the unit of I/O or unit
> > > of allocation.
> >
> > I don't see how since there is clearly a correlation between the size of
> > each unit of I/O and the contention for what is on each unit of I/O.
> Locking doesnít even need to be page based. Are you aware of the
> distinction between latches and locks?

Yes, I am. A latch guarantees the integrity of a row during the reading of that row.

The overhead required for row-level locking can drastically reduce performance. For an update that touches a few rows, row-level locking is optimal. For an update that touches several thousand rows, locking pages may make more sense. For an update that touches millions of rows, extent locks or even table locks are more efficient. Another factor is the unit of I/O for the transaction logs. If the transaction log contains the pages that are different instead of just the rows that are different, then pages must be locked along with any rows, because if one transaction changes one row on a page, and another transaction changes another row on the same page, and if one of those transactions is rolled back, then the rollback could overwrite the change made by the transaction that will ultimately commit, or if both roll back, the database could end up having one of the changes recorded. It gets a lot more complicated when there are a lot of index pages recorded in the log as well as just rows, because indexes must also be locked, or at least invalidated until the updates propogate to them.

> > >> > 64k blocks are generally too small on modern disks. A 64k block
> > >> > can
> > >> > be transferred in a tenth of the time it takes to seek to it.
> >
> > >> Why is that a problem? Isn't it more efficient if I can satisfy the
> > >> same
> > >> query by reading 100 64K blocks instead of 100 1M blocks?
> >
> > > Yes, but a DBMS will often be able to satisfy a given query by reading
> > > fewer blocks. For example full table scans are much more efficient
> > > with 1M blocks. Also, if you increase the block size by 10x then the
> > > height of a B+Tree will tend to smaller. For example a tree of height
> > > 3 may be able to index a trillion rather than only a billion records.
> >
> > I think that if the disk subsystem is sophisticated enough, the
> > performance
> > benefit of an increased block size is lost. For example, if the
> > controller
> > caches entire tracks to eliminate latency, then a larger block size
> > would
> > not improve performance one bit--in fact, it would tend to reduce it
> > because
> > of the vastly increased volume of data that must be transferred from the
> > cache to RAM. In the example above, you would have to move more than 16
> > times as much data from the cache to RAM to answer the same query.
> I agree that if the disk subsystem caches a track then a smaller unit
> of I/O can be desirable.
> I think weíre talking past each other because youíre associating block
> with unit of I/O whereas Iím associating it more with the unit of
> allocation. As I see it, in the above youíre only comparing
> relatively small and uninteresting differences in read performance for
> a given allocation of data to sectors and tracks.
> Changing the unit of allocation has an enormous impact on how data is
> allocated by the DBMS. This in turn can have a huge impact on the
> number of seeks.
> With a very small unit of allocation, on a small contiguous area of
> the disk there can be allocations for many unrelated tables written by
> many unrelated transactions. In theory the sweet spot for the unit of
> allocation occurs when the time to seek is similar in magnitude to the
> time to transfer.

Interestingly, on a high-end 15k drive, revolutions take 4ms, and the average seek time is 2.9ms. Reading an entire track, therefore, should take at most 4ms. Based on comparisons of the specs for different size drives of the same series, I don't think more than one head is being read at the same time, but if they implemented that, it would be possible to read an entire cylinder in the time it takes to read a track, 4ms. But then, it might be cheaper to just add more drives.

> One can imagine more sophisticated allocators Ė eg that reserve large
> areas of the disk in ways that promote better clustering of related
> data. One can even imagine that multiple independent allocators (ie
> heaps) should be used within the DBMS. However the same effect can
> be achieved more simply and with less wastage of space by using a
> single allocator for the entire DBMS from which reasonably large
> blocks are always allocated. This can be compared to an in-memory
> programming environment that employs a single heap allocator for all
> threads, and wherever that would lead to high latency due to poor
> localisation of very small allocations, the programmer instead
> allocates a somewhat larger block and breaks it up into smaller pieces
> to meet the requirements of the smaller allocations.

It's a trade off: bigger allocation units would increase the throughput for scans but would reduce the throughput for seeks.

> I cannot see any good reason why a DBMS would skimp on the unit of
> allocation (eg 64k), particularly when it can be coarser than the unit
> of I/O (eg 8k). I wonder whether legacy of code base or backwards
> compatibility is rearing its ugly head?

Possibly. I would think, however, that given the competitiveness between Oracle and Microsoft, that if more speed could be achieved by increasing the allocation unit, then one of them would have done it. They may also be anticipating the development of lower cost solid-state drives, where seek times and latency disappear, and there a larger block size would be a disadvantage.

> > > The advantages of a larger block size are more apparent in a database
> > > storing data where there is a greater tendency for locality based on
> > > affinity to be useful. For example, it would be rather silly to use
> > > 64k blocks to store multi-resolution terra pixel images.
> >
> > It would be equally silly to physically store multi-resolution terra
> > pixel
> > images alongside scalar data that can be joined or restricted on. If
> > necessary, place the image heap on a separate disk subsystem with a
> > separate
> > stripe size and depth, but store the scalar data on disk subsystem with
> > a
> > stripe size optimal for computing joins and restrictions on it.
> I agree that that the optimal unit of I/O depends on the nature of the
> data and its usage patterns. I donít think multiple heaps is
> generally necessary.
Received on Mon Sep 01 2008 - 15:00:13 CEST

Original text of this message