Re: How do indexes work on non-unique values ?

From: Pablo Sanchez <pablo_at_dev.null>
Date: Fri, 24 Jan 2003 10:25:33 -0600
Message-ID: <Xns930D5FE1E539pingottpingottbah_at_216.166.71.233>


williams.rhian_at_freenet.de (rhian) wrote in news:a0310e5c.0301231242.1a8c1512_at_posting.google.com:

> i have been told that creating an index on a field which can only
> contain 2 possible values (i.e male/female) is a bad idea be because
> order(n/2) is still order(n) and you get no benefit from doing a
> binary search on the index file.

The issue is what type of index is built to support the column. Typically, you either use a dense or a non-dense index -- or a bit-indexe as the other poster mentioned when the data doesn't change much.

A dense index is one where the leaf node of the index tree has an entry for every single row in the table. The values are sorted within the b-tree for quick access. The downfall is that the data the pointers point to may not be in any order.

Recall that the DBMS stores rows in a 'data page' - the atomic unit of storage. Given leaf-node values like the following:

    value | rowid


      m    |   333
      m    |   111
      m    |   123
      m    |   999
      ...
      m    |   334

Let's say that the first two values of the rowid is the data page:

      rowid 333 => data page 33, row #3
      rowid 111 => data page 11, row #1
      ...

If we pretend we're the DBMS and we're looking for all entries that are 'm', we would need to fetch data page 33, read row 3, then data page 11, read row 1, etc etc.

It may not be possible to cache the entire table in the DBMS cache therefore the optimizer will decide that instead of walking the index tree to satisfy the query, it'll elect to pursue a full table scan. This will be more efficient than possibly having to re-fetch the same data page later on. Let me explain.

In the above example, let's say our table had 4 gigabytes of information but our DBMS only had a 1.4G cache. No way to cache the entire table -- nevermind that other queries may be competing for the cache as well!

If we walked the index, in order to fetch 'm/333' we'd read data page 33. (Much ?) later, we fetch 'm/334' and it's possible that 'data page 33' is no longer in cache, so we're forced to perform another physical I/O.

If we would have performed a table scan, we read the data page once and fetch the qualifying rows.

Cool?

> I think i understand how an index works and is beneficial when created
> on a unique value, such as primary key. e.g
>
> [ snipped ]
>
> I don't understand how this works for indexes on foreign keys, which
> i understand to be beneficial or how it would work for my initial
> example

Works the same way ... as long as you don't have data skew: too many rows of one type. Perhaps I'm misunderstanding your question.

Thx!

-- 
Pablo Sanchez, High-Performance Database Engineering
http://www.hpdbe.com
Received on Fri Jan 24 2003 - 17:25:33 CET

Original text of this message