Re: index usage

From: DA Morgan <damorgan_at_x.washington.edu>
Date: Fri, 21 Jan 2005 11:35:41 -0800
Message-ID: <41f15980$1_2@127.0.0.1>

Richard Foote wrote:

> "hastenthunder" <hastenthunder_at_hotmail.com> wrote in message
> news:56uHd.2452$Ny6.4229_at_mencken.net.nih.gov...
>

>>Hello,
>>
>>I've read many documentations online stating to only create an index if
>>queries against this table frequently retrieve less than 15% of the rows.
>>However, if the query returns, say, 40% of the rows, wouldn't indexing the
>>column still help by cutting the work by roughly half?
>>
>>
>>hastenthunder
>>

>
>
> A much *simplified* example on how I teach this stuff...
>
> Let's say we have a table that has 10,000,000 rows which are stored in
> 1,000,000 data blocks meaning we have approximately 10 rows per block on
> average.
>
> Let's say we have an index on this table that has 100,000 leaf blocks
> meaning we have on average approximately 100 leaf entries per leaf block the
> index has 3 levels.
>
> Let's also say we have an "effective" multi-block read capability of 10
> blocks per I/O (meaning Oracle will read 10 "consecutive" blocks at a time
> on average during a full table scan multiblock read).
>
> Finally, let's say we're interested in accessing *just* 10% of the data (or
> 1,000,000 of the total 10,000,000 rows). Will Oracle use the index or won't
> it ? Hopefully, I've picked an easy set of numbers to help illustrate the
> answer ...
>
> Firstly, to calculate the "cost" of using the index access path.
>
> We need to read the root block + a branch block in order to get to the first
> leaf block of interest. That's 2 logical I/Os (LIOs). We then need to read
> approximately 10% of the leaf blocks in order to get our 1,000,000 leaf
> entries required to directly access our 1,000,000 rows of interest, that's
> 10% of the 100,000 leaf blocks = 10,000 leaf blocks. Because we're reading
> an index via a range scan and because the leaf blocks are not (necessarily)
> physically co-related, Oracle must read each leaf block via a single I/O. So
> that's 10,000 LIOs. So, just to read the index alone, we require 2 + 10,000
> = 10,002 LIOs.
>
> Note by default, Oracle assumes the above "cost" to be physical I/Os (PIOs).
> Now assuming this index is heavily accessed, a good number of these index
> blocks may already be cached in memory. The optimizer_index_caching
> parameter can be used to adjust the above cost by suggesting that x% are
> actually already cached and so are "cheaper" to access. To keep things
> simple, we'll assume the default value of 0% or that no index blocks are
> actually likely to be cached (generally not a wise assumption but let's keep
> the arithmetic simple).
>
> To access the corresponding table blocks, again Oracle can only perform
> these reads via a single block read as each index entry points to a table
> block that contains it's specific table row . Now we're after 1,000,000 rows
> which means we require 1,000,000 LIOs in order to access the required rows.
> Question is, how many *different* table blocks do we need to access ? Well,
> this is entirely dependent on the Clustering Factor (CF) of the index, or
> how closely aligned are the corresponding rows in the table in relation to
> the order of the index (which must be in the order of the index values). In
> the "best" possible case, all the required rows are all ordered and grouped
> together in the same "collection" of table blocks meaning we only have to
> access 10% of the 1,000,000 table blocks or 100,000 table blocks in a
> roughly *consecutively* manner.
>
> However, as is more common, if the required rows are randomly and evenly
> distributed among the table blocks, then on average we need to read 1 row
> (10%) from *each and every table block*. Note in your case of wanting to
> access 40% of the data, we might depending on a poor CF need to visit on
> average *each and every* data block *4 times*. This is the key point (no pun
> intended).
>
> The greater the number of differing blocks we access, then the less likely
> we will find the block in memory from it being previously read and the more
> likely that the block will need to be read from disk (PIO). Oracle considers
> this and uses the CF in it's costing calculations. Assuming a randomly
> distributed set of required rows, note we will need to visit *all* the table
> blocks on average because on average we are interested in 1 in 10 of the
> rows that each block contains (yes, some blocks may not actually be visited
> and some may be visited a number of times but with such volume of blocks, it
> conceivably might be a significant duration between reads to the same block
> meaning it could easily have been aged and be physically re-read anyways).
>
> The point though is that it's 1,000,000 LIOs regardless, of which a very
> significant number *could* be *actual distinct* (or differing) blocks. So
> that's 10,002 for the index + 1,000,000 for the table = 1,010,002 LIOs to
> read *just* 10% of the data via an index.
>
> Now to calculate the "cost" of a FTS. A FTS has a number of advantages over
> an index access path. Firstly, because we read each block "consecutively"
> (kinda) Oracle can investigate the appropriate selectiveness of each row
> within the block ensuring that each table block is read just *once* (special
> blocks such as extent maps withstanding). Secondly, again because each block
> is read consecutively, Oracle can perform a multi-block read and read
> multiple blocks within the one LIO. This is based on factors such as
> db_file_multiblock_read_count, system statistics, OS I/O characteristics,
> the caching characteristics of the table and the "fudge-factor" that the
> Oracle CBO applies in it's calculations.
>
> For simplicity (and to keep the numbers really simple), assuming an
> effective multi-block read of 10, we can read the entire table in
> approximately 1,000,000 table blocks / 10 = 100,000 LIOs. Note that
> although these are larger and potentially more "costly" I/Os than the single
> block I/Os used by the index, Oracle assumes by default that the actual cost
> of each type of I/O to be the same. The optimizer_index_cost_adj parameter
> can be used to more accurately estimate (if necessary) the relative cost of
> a single block I/O to that of a FTS multi-block I/O. Again for simplicity,
> we'll assume the default of 100 meaning that the cost of a single block I/O
> is 100% (or the same) as a FTS I/O.
>
> So, we now have our two comparative costings. The index access has a rough
> cost of 1,010,002 and the FTS has a rough cost of just 100,000. The FTS wins
> hands down.... Note for 40% of the data, the relative costs would have been
> roughly 4,040,002 vs. 100,000. Even more hands down ...
>
> The break-even point can now be calculated based on the above criteria, some
> of which include:
>
> - the selectivity of the query
> - number of leaf blocks
> - average number of leaf entries per leaf block
> - height of index
> - caching characteristics of index
> - clustering factor of index
> - number of table blocks (below HWM)
> - average number of rows per block
> - effective (or calculated) multi-block read
> - caching characteristics of the table (which can influence the effective
> multi-block read)
> - relative cost of a single block I/O vs. a multi-block I/O
> - amount of row migration / row chaining (although the CBO is not so good
> with this)
> - parallelism (potentially a major factor)
>
> So your assumption that reading 40% of rows would cut the work by roughly
> half is not correct. In the example above, it would actually cost about 40
> times as much. In my long-winded manner, I hope this makes some kinda sense
> and goes some way to explaining why.
>
> One final piece of advice. Ignore any writings or suggestions that there is
> a magical break even point is x% (where x could be 2% or 10% or 50% or
> whatever). Hopefully the above will hint that there is *no* such percentage
> as it all depends on too many factors. I can easily give you an example
> where an index is most efficient when reading 0% of data and I can easily
> give you an example where an index is most efficient when reading *100%* of
> data (and *any* value in between). When one understands how the CBO
> functions, one understands why such so-called rules of thumb are a nonsense.
>
> Cheers
>
> Richard

A well written and remarkable piece of writing. I have copied this post to my indexes page with full attribution for its source.

-- 
Daniel A. Morgan
University of Washington
damorgan_at_x.washington.edu
(replace 'x' with 'u' to respond)


----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= East/West-Coast Server Farms - Total Privacy via Encryption =---

Received on Fri Jan 21 2005 - 13:35:41 CST