Re: B+ Tree and its concept

From: Larry Coon <lcnospam_at_assist.org>
Date: Mon, 09 Oct 2006 10:46:32 -0700
Message-ID: <452A8AF8.5A2_at_assist.org>


accpactec_at_hotmail.com wrote:

I'm going from memory here (and experience with certain implementations that may or may not be universal), so please don't take this as gospel.

> Hi Everyone,
> Here is a question that I came up while reading up on b+ trees. For
> some reason I am having a problem understanding some of the terms and
> the mechanical part of the tree.
>
> So from what I understand, b+ tree is a representation of the index
> that was created from the data (data record).

Well, just to be pedantic, a b+ tree is often used as an index, but you can use b+ trees for applications other than indexing.

Also, I'm not sure if by "is a representation of the index" you mean "is the index" (i.e., the index itself is implemented as a b-tree), but this is usually (almost universally?) the case.

> The tree has a root
> followed by several levels,

"Several" being zero or more. In fact, the root level is just another level.

> each made up of NODES and then the leafs
> (are leafs also called nodes?). The data stored in the leaf is known as
> the DATA ENTRY which includes a search key and a RID (memory page #,
> slot #). The leafs are basically the last level of the tree which point
> to the actual data.

You mention clustered indexes later, but this is the right place to mention this. If the data are already ordered in index order, then there's
no need to maintain the leaf level of the index separate from the data. A typical optimization is to use the data pages AS the leaf level of the index.

> An INDEX ENTRY is a combination <data, pointer>. A
> NODE is a combination of one or more Index Entries.
>
> If the data record is in the same sorted order as the index entry, the
> b+ tree is clustered, which means less I/Os to find a retrieve data
> records.

See above. An index being clustered is actually somewhat of a misnomer, since the implication is on table ordering. But as I say above, using the data pages as the leaf level, your statement is mostly correct.

> Each node on the b+ tree including the root is at least 50% full.
> (order of the tree)

There's no hard requirement for the threshold -- trees can split or combine at any percentage (66%, 75%, etc.). Nor is there a requirement for the splitting/combining algorithm. For instance, there's a b-tree variant that always splits two nodes into three, rather than one node into two.

> The b+ tree representation of the index is stored in a file.

This is definitely not a requirement nor a characteristic of b-trees. Implementations are free to store b-trees entirely in memory. However, practical considerations (amount of memory vs. amount of data) often dictate that the b-tree is stored on secondary storage (note I didn't say "file" or even "disk").

One thing to keep in mind is that with efficient caching algorithms, you'll often find that the first few levels of a big and frequently accessed index never age out of cache. For example, with a four-level index you can often navigate to the correct page with three memory I/Os and one disk I/O.

> The internal nodes of a b+tree have n data and n+1 pointers.

Here's where we need to talk about the difference between a b-tree and a b+ tree. In a b+ tree, the leaf pages are linked together in a linked list (or doubly-linked list), which optimizes traversals (for range searches, sorting, etc.). So the number of pointers is a function of the specific algorithm that is used.

> So interms of I/O, since the b+ tree is stored in a file, if we want to
> access multiple data record, are we going to have multiple reads of the
> b+ tree file or is it that once you read the file, everything is held
> in memory and therefore, reducing the number of IOs.

Multiple random reads, or a range of reads?

For multiple random reads you need to start at the root of the index each time, but as I said above, with proper caching the root (and likely a couple more levels) will be in memory already, so a disk I/O won't be required.

For a range of reads, once it navigates to the first record it simply walks the linked list until it gets to the end of the range.

> also how is the order determined?

Typically the best order is a function of disk I/O size and record size.  

Larry Coon
University of California Received on Mon Oct 09 2006 - 19:46:32 CEST

Original text of this message