Re: the 20% rule

From: Tim X <timx_at_nospam.dev.null>
Date: Sat, 26 Jul 2008 15:54:59 +1000
Message-ID: <87zlo5p4uk.fsf_at_lion.rapttech.com.au>


aarklon_at_gmail.com writes:

> Hi all,
>
> I recently read the following in a book(oracle 9i for dummies by
> Carol McCullough Dieter)
>
> The fastest way to retrieve rows from a table is to access the row
> with exact row id. An
> index is the second fastest way, but it decreases in performance as
> the proprotion of the rows retrieved increases. if you are retrieving
> approximately 20 % of the rows in a table, using a index is just as
> fast. But beyond that magic 20 %, not using 20% is faster.
>
> keep this rule in mind when you create indexes intended to help speed
> up a query.
> Queries vary in the rows that they select from a table. if you have a
> query that you use often,
> determine the number of rows that it selects from the table. if this
> number is more than 20%
> of the total no: of rows in the table , an index on the table may not
> improve the performance
> of the query. you may just want to try both methods. if the number of
> rows is less than 20%, an index will almost certainly help
> performance.
>
> my question is to what extent this rule is true ?

This is what I would call a heuristic rather than a formal/solid rule - a rule of thumb if you like.

The extent to which it is true depends on a number of factors. To understand to what extent it is true, you need to udnerstand the rationale underlying it.

consider an overly simplistic example.

You have a table with 100 records. Each record is 100k in size.

If you had a query you knew would return 10 records, consider the case with and without an index.

In the case without the index, you would need to retrieve all rows and determine if they match your selection criteria. Thats 100 disk reads of 100k each. Lets say its a very slow computer and each disk read takes 1 second. You would have 100 x 1 second to retrieve all the rows (100 seconds) and 100 x 100k - 10000k.

Now if you had an index, lets say the field being indexed is 10k. Assume for simplicity, it a really good index and you can identify all 10 rows with just 10 reads of the index. Again assuming 1 second per read. You will have

10 x 1 sec for the index = 10 secs and 100k data 10 x 1 sec for the 10 rows = 10 sec and 1000k = total of 20sec and 1100k.

Now consider a query where you know 60 of the records will be returned. Same index.

60 x 1 secs = 60 seconds + 600k index retrievals 60 x 1 secs = 60 seconds + 6000 table retrievals giving 120 sec and 6600k data. This means it took 20 seconds longer with the index and you only retrieved 60% of the records compared to 100% retrieval without in 100 seconds. So, in this case, it would be faster without the index (ignoring data transfer overheads and possible impact on memory caches due to retrieving more data etc)

I said this was simplistic. A lot of other factors come into play - particularly with databases like Oracle where you have lots of configuration variables, such as shared global memory areas, cached queries, various different index schemes, different tablespace configurations that can affect how efficient table scans/reads are etc.

In general, the larger the percentage of the rows that will be returned by a query, the less benefit an index can be. At some point, there is a cross-over where the index actually decreases efficiency rather than improves it. The extreme case is when you are going to need to read all rows - in this situation, using an index to identify the row is gaining you nothing - you would have read that row anyway (though even this is a rule of thumb - indexes can still be useful when you are talking about table joins or correlated subqueries etc.

With Oracle, it is generally best to rely on tools like explain plan. Rules like the one you mentioned can be a good guide and a helpful tool in diagnosing performance issues, but you should only use them as a guide. Nothing beats hard imperical facts. Consider rules like that one to be something that can give you some direction on what to look at when trying to improve performance. Consider other operations as well. Does it really matter if queries become a bit slow if the index speeds up inserts/updates (which would be unusual, but not unknown) and you have a lot of inserts/updates compared to selects? Would indexing a different field be better?

Of course, other factors should always be taken into consideration. For example, what are the users expectations? How often is that query made compared to other queries on the table? Will your expectations of percentage of row returned change over time or with changing business requirements or profiles/operations? Is the hardware profile one where there is lots of memory and large cases with slow disk IO or is it a system with less memory but really fast disk IO or some combination?

The figure of 20% seems somewhat arbitrary to me. I would expect this figure would vary depending on the size of the data being indexed, the type of index, the size of the rows in the table, the type of the data in the index e.g. VARCHAR2 compared to integer etc.

Finally, note that I'm just a run of the mill developer that works a lot with Oracle. I'm not an expert in Oracle tuning and know enough to know I don't know enough. However, I have learnt to take many of the "rules" I've seen with a grain of salt and not to consider them as hard and fast rules that should never be broken.

regards,

Tim

-- 
tcross (at) rapttech dot com dot au
Received on Sat Jul 26 2008 - 07:54:59 CEST

Original text of this message