Re: B-Tree's and internal DBMS implementations

From: Bob M. Lee <rmleeusa_at_netscape.net>
Date: Wed, 1 Jan 2003 02:50:25 -0800
Message-ID: <v15hv1qkdft8c0_at_corp.supernews.com>


"John" <JonJones(NOSPAM)_at_hotmail.com> wrote in message news:auqvhv$3ng$1_at_news8.svr.pol.co.uk...
> Hi all.
>
> I'am somewhat curious about the B-tree algorithms and how RDBMS us them
> successfully. My question is, what is stored at the leaf level? is it
memory
> addresses for the set that matches a point query?

Do not confuse a B-tree (as an in-memory data structure) with what an RDBMS is doing with paged disk blocks mapped to a SQL SORT INDEX.

To an RDBMS engine, a query result-set will contain a set of symbolic database keys - not memory addresses.

The symbolic database keys are then handled by the RDBMS page IO routines to retrieve data into memory buffers,
when it is appropriate (depending on size of query cursor, available memory, size of result-set, available process pages, etc.). The symbolic database keys allow the actual data to be opportunistically scattered on the available disk volumes, by providing a level of indirection between the logical location of the data rows and the physical location of the data rows on-disk.

A SQL "SORT INDEX" is an on-disk structure that approximates a B-tree - but it will do this by using symbolic database keys to point to data rows, instead of memory address pointers, or actual block numbers, to locate data rows on-disk.

> I assume a relation that has 3 indexes would have 3 corresponding B-trees
for each of the indexes?
>

There are many kinds of indices: An on-disk SQL "SORT INDEX" would closely approximate a B-tree data structure.
If the table had (3) sort indices - then (3) distinct B-tree-like on-disk structures are being navigated by the RDBMS engine.

> How does the B-Tree deal with the LIKE clause, does that then cause a
linear
> search through the tree?
>

Interesting question: The SQL "LIKE" clause can cause full sort index scans if the data column being queried isn't indexed efficiently, or isn't indexed at all.

If the leaf nodes of the sort index contain specific sets of data values for the indexed data columns, and each leaf node has a database key to a specific data row, then the upper level sort index nodes can be said to span ranges of data values for the indexed data columns.

Depending on the datatype of the column in the "LIKE" clause, the query engine can opportunistically decide to use natural datatype range ordering to limit the range of the index scan. Since "LIKE" is most frequently used to perform string pattern matches, as a built-in GREP engine, it has been reported that some academic query engines actually implement GREP-like code to improve the speed of matching string data values by borrowing from the PERL source code.

The good news is that the allowed SQL datatypes are not infinite - so the datatype pattern matching rule implementation is simply a matter of taking time to worry the details.

The bad news is that SQL does not limit the number of rows in a table. As an RDBMS designer, a single in-memory B-tree can not be expected to be sufficient when constructing a SQL "SORT INDEX" on a table with over a thousand billion rows of data.
Careful, repeatable, subdivision of the SORT INDEX build is required to meet the constraints of available virtual memory and IO bandwidth.

Remember: A SQL "LIKE" verb in a SELECT statement - does not, in and of itself, require the return of a query result-set that is in any data-value sort order whatsover. An "ORDER BY" or "GROUP BY" clause would be required to force ordering by data column values, and the query engine may use this as a strong hint to look for symbolic database keys in a specific index based on the available indexed data columns.

> Thanks, Jon
>
>

Hope this helps,

    /b Received on Wed Jan 01 2003 - 11:50:25 CET

Original text of this message