Re: index usage

From: Richard Foote <richard.foote_at_bigpond.nospam.com>
Date: Thu, 20 Jan 2005 11:36:43 GMT
Message-ID: <fJMHd.126019$K7.84069@news-server.bigpond.net.au>

"hastenthunder" <hastenthunder_at_hotmail.com> wrote in message news:56uHd.2452$Ny6.4229_at_mencken.net.nih.gov...

> Hello,
>
> I've read many documentations online stating to only create an index if
> queries against this table frequently retrieve less than 15% of the rows.
> However, if the query returns, say, 40% of the rows, wouldn't indexing the
> column still help by cutting the work by roughly half?
>
>
> hastenthunder
>

A much *simplified* example on how I teach this stuff...

Let's say we have a table that has 10,000,000 rows which are stored in 1,000,000 data blocks meaning we have approximately 10 rows per block on average.

Let's say we have an index on this table that has 100,000 leaf blocks meaning we have on average approximately 100 leaf entries per leaf block the index has 3 levels.

Let's also say we have an "effective" multi-block read capability of 10 blocks per I/O (meaning Oracle will read 10 "consecutive" blocks at a time on average during a full table scan multiblock read).

Finally, let's say we're interested in accessing *just* 10% of the data (or 1,000,000 of the total 10,000,000 rows). Will Oracle use the index or won't it ? Hopefully, I've picked an easy set of numbers to help illustrate the answer ...

Firstly, to calculate the "cost" of using the index access path.

We need to read the root block + a branch block in order to get to the first leaf block of interest. That's 2 logical I/Os (LIOs). We then need to read approximately 10% of the leaf blocks in order to get our 1,000,000 leaf entries required to directly access our 1,000,000 rows of interest, that's 10% of the 100,000 leaf blocks = 10,000 leaf blocks. Because we're reading an index via a range scan and because the leaf blocks are not (necessarily) physically co-related, Oracle must read each leaf block via a single I/O. So that's 10,000 LIOs. So, just to read the index alone, we require 2 + 10,000 = 10,002 LIOs.

Note by default, Oracle assumes the above "cost" to be physical I/Os (PIOs). Now assuming this index is heavily accessed, a good number of these index blocks may already be cached in memory. The optimizer_index_caching parameter can be used to adjust the above cost by suggesting that x% are actually already cached and so are "cheaper" to access. To keep things simple, we'll assume the default value of 0% or that no index blocks are actually likely to be cached (generally not a wise assumption but let's keep the arithmetic simple).

To access the corresponding table blocks, again Oracle can only perform these reads via a single block read as each index entry points to a table block that contains it's specific table row . Now we're after 1,000,000 rows which means we require 1,000,000 LIOs in order to access the required rows. Question is, how many *different* table blocks do we need to access ? Well, this is entirely dependent on the Clustering Factor (CF) of the index, or how closely aligned are the corresponding rows in the table in relation to the order of the index (which must be in the order of the index values). In the "best" possible case, all the required rows are all ordered and grouped together in the same "collection" of table blocks meaning we only have to access 10% of the 1,000,000 table blocks or 100,000 table blocks in a roughly *consecutively* manner.

However, as is more common, if the required rows are randomly and evenly distributed among the table blocks, then on average we need to read 1 row (10%) from *each and every table block*. Note in your case of wanting to access 40% of the data, we might depending on a poor CF need to visit on average *each and every* data block *4 times*. This is the key point (no pun intended).

The greater the number of differing blocks we access, then the less likely we will find the block in memory from it being previously read and the more likely that the block will need to be read from disk (PIO). Oracle considers this and uses the CF in it's costing calculations. Assuming a randomly distributed set of required rows, note we will need to visit *all* the table blocks on average because on average we are interested in 1 in 10 of the rows that each block contains (yes, some blocks may not actually be visited and some may be visited a number of times but with such volume of blocks, it conceivably might be a significant duration between reads to the same block meaning it could easily have been aged and be physically re-read anyways).

The point though is that it's 1,000,000 LIOs regardless, of which a very significant number *could* be *actual distinct* (or differing) blocks. So that's 10,002 for the index + 1,000,000 for the table = 1,010,002 LIOs to read *just* 10% of the data via an index.

Now to calculate the "cost" of a FTS. A FTS has a number of advantages over an index access path. Firstly, because we read each block "consecutively" (kinda) Oracle can investigate the appropriate selectiveness of each row within the block ensuring that each table block is read just *once* (special blocks such as extent maps withstanding). Secondly, again because each block is read consecutively, Oracle can perform a multi-block read and read multiple blocks within the one LIO. This is based on factors such as db_file_multiblock_read_count, system statistics, OS I/O characteristics, the caching characteristics of the table and the "fudge-factor" that the Oracle CBO applies in it's calculations.

For simplicity (and to keep the numbers really simple), assuming an effective multi-block read of 10, we can read the entire table in approximately 1,000,000 table blocks / 10 = 100,000 LIOs. Note that although these are larger and potentially more "costly" I/Os than the single block I/Os used by the index, Oracle assumes by default that the actual cost of each type of I/O to be the same. The optimizer_index_cost_adj parameter can be used to more accurately estimate (if necessary) the relative cost of a single block I/O to that of a FTS multi-block I/O. Again for simplicity, we'll assume the default of 100 meaning that the cost of a single block I/O is 100% (or the same) as a FTS I/O.

So, we now have our two comparative costings. The index access has a rough cost of 1,010,002 and the FTS has a rough cost of just 100,000. The FTS wins hands down.... Note for 40% of the data, the relative costs would have been roughly 4,040,002 vs. 100,000. Even more hands down ...

The break-even point can now be calculated based on the above criteria, some of which include:

the selectivity of the query
number of leaf blocks
average number of leaf entries per leaf block
height of index
caching characteristics of index
clustering factor of index
number of table blocks (below HWM)
average number of rows per block
effective (or calculated) multi-block read
caching characteristics of the table (which can influence the effective multi-block read)
relative cost of a single block I/O vs. a multi-block I/O
amount of row migration / row chaining (although the CBO is not so good with this)
parallelism (potentially a major factor)

So your assumption that reading 40% of rows would cut the work by roughly half is not correct. In the example above, it would actually cost about 40 times as much. In my long-winded manner, I hope this makes some kinda sense and goes some way to explaining why.

One final piece of advice. Ignore any writings or suggestions that there is a magical break even point is x% (where x could be 2% or 10% or 50% or whatever). Hopefully the above will hint that there is *no* such percentage as it all depends on too many factors. I can easily give you an example where an index is most efficient when reading 0% of data and I can easily give you an example where an index is most efficient when reading *100%* of data (and *any* value in between). When one understands how the CBO functions, one understands why such so-called rules of thumb are a nonsense.

Cheers

Richard Received on Thu Jan 20 2005 - 05:36:43 CST