Re: indexes vs sorted files in range and equality search

From: Jonathan Leffler <jleffler_at_earthlink.net>
Date: Thu, 06 Dec 2007 15:20:53 -0800
Message-ID: <13lh0uugcur4t24_at_corp.supernews.com>


xareon_at_gmail.com wrote:
> hi all, i got a question about the use of indexes and sorted files in
> range and equality searches. like you'll sure know
>
> B: number of data pages
> R: number of records per page
> D: (average) time to read or write disk page
>
> and let's suppose C (data matching time) and H (hashing time) as
> unimportant.
>
>
> theory shows us those simple values (let's take F>2):

At this point, F is undefined, so it is hard to know what the significance of F>2 is.

> sorted files: equality search: Dlog2(B) / range search: Dlog2B +
> matched pages

I'm not clear what your notation means. I think you've compressed two expressions onto one line:

Sorted files:
  Equality search time: D.log2(B)
  Range search time: D.log2(B) + matched pages

The slash is just a separator, not a division operator?

You seem to be hypothesizing random access by pages, and using a binary search algorithm? And what do you do with a data set that consists of one record with key 1, 2 million records with key 2, and one record with key 3, (with key values key 1 < key 2 < key 3) and you do an equality search for key 3? Are you looking for all records or just a representative record?

How is your range defined? If it is defined as KL (low key) to KH (high key), you could do a (binary) search for KL and then continue reading sequentially until you reach a value beyond KH (or reach KH if your range excluded the upper bound).

> unclustered tree index: equality search: D(1+logF(1.5B)) / range
> search: DlogF(1.5B) + matched *records*

We've got an undefined term in here - so the result is undefined, is it not?

In your terminology, what constitutes an unclustered tree index? One where the index is ordered (of course) but the data rows corresponding to the indexed values are not necessarily in the order of the index?

If so, then (give or take the caveats about duplicate versus unique keys mentioned before), the cost of the index searching depends on the depth of the index (number of index pages that must be read to find the leaf pages) plus the cost to read however many rows are matched by the equality or range. Depending on how many rows there are to read and on caching (or not) of the data pages, your cost is difficult to calculate; you need to give a lot more assumptions. An upper bound is one disk page read per row retrieved - but most systems would hope to reduce that cost by caching etc.

> clustered tree index: equality search: DogF(1.5B) / range search:
> DlogF(1.5B) + matched pages

Is your clustered index one where the rows corresponding to adjacent keys in index order are themselves on adjacent (or even the same) page?

> hash index: equality search: 2D / range search: BD

How is your range search defined again? If you are searching for values between KL and KH (which are distinct), then you have to do N = (KH - KL + 1) hash computations, one for each value that exists between KL and KH. For integers, that might not be too bad; for strings, it might be catastrophic. If, on the other hand, your 'range search' is for 'all values matching the same single key value', then you have a single access to get to the head of the list, plus a scan of some sort for the values that hash to the same value (not all of which need be the value you are looking for, of course -- hashes have collisions).

> so, i'd say, about a range search: clustered tree indexes are the
> best, then we have sorted files, then unclustered tree indexes
> (although they got a faster search in the tree because of the logF
> performance over the sorted log2 one, they require many I/O operations
> on different pages, whose time cost overwhelms the such logF
> benefits). the worst are hash indexes, totally useless for ranged
> searches.
>
> about an equality search, hash indexes are the best. then, we have
> clustered tree indexes (*STRICTLY* better than sorted files *only*
> because the logF trend over log2 one?), sorted files and unclustered
> tree indexes, being the worst because if an equality search has many
> relevant records, then its analysis becomes similar to the range
> search one, which shows that many I/O operations (one for record) for
> such index are required, since records could never be on the same
> page.
>
>
> is my analisys basically cannot-deny-that-so-oh-my-god-accomplished-
> correct? :)

No. Your analysis between sorted files and unclustered indexes has not taken into account size of record compared with size of key. Suppose you have a key size of 8 bytes and a record size of 256 bytes, and a page size of 1024 bytes. Suppose that pages are identified by an 8 byte quantity. And suppose the index is a unique index. You can store 64 key+value quantities on an index page, so even with an equality search, you'll read many fewer index pages with an unclustered index (and one random data page) compared with just 4 records per page in the sorted file. If your data set is any size at all, you'll need to read many more data pages in your sorted file to get to the relevant page than you will with an un-clustered index. Suppose there are 4096 records, and the index is perfectly constructed. Then you'll need 2 index page reads plus 1 data page read to get any row back. By contrast, on average, you'll need to read an average of 10 data pages in a sorted file. If I'd increased the number of records, the comparison would be more skewed; a quarter of a million records would require 3 or maybe 4 page reads via an index, compared with 16 reads in a sorted file. There are some sweeping approximations in that - like ignoring page overhead and identifying leaf nodes from interior nodes in the index and so on. But the calculation is indicative of the difference. And the difference between a clustered index and an unclustered index - under the assumption that the difference in definition is whether the rows are in the same order as the keys - is not that huge (especially when you consider the cost of updating a clustered index if a new value has to be inserted into the middle of the sorted order - or, worse, at the beginning of the sorted order; then you might have to rewrite a lot of data to keep everything in clustered - sorted - order!).

Also, with multi-level indexes and any caching, the upper levels of the tree are in cache (unless your cache is absurdly small). This gives indexes a bigger advantage.

If I had to characterize them, I'd say hash indexes are great for equality searches and lousy for anything else. Tree indexes are not bad for equality searches (though not as fast as a hash), but much better than hash for range searching. That's why B+Trees are so widely used in commercial DBMS.

-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler_at_earthlink.net, jleffler_at_us.ibm.com
Guardian of DBD::Informix v2007.0914 -- http://dbi.perl.org/

publictimestamp.org/ptb/PTB-1963 ripemd128 2007-12-06 21:00:03
A7304ADE234A191551AE2987CF841EFD
Received on Fri Dec 07 2007 - 00:20:53 CET

Original text of this message