Re: Index management

From: Richard Foote <richard.foote_at_tbigpond.nospam.com>
Date: Sat, 01 May 2004 14:18:29 GMT
Message-ID: <VkOkc.7206$TT.2330@news-server.bigpond.net.au>

"Noons" <wizofoz2k_at_yahoo.com.au> wrote in message news:40936644$0$12740$afc38c87_at_news.optusnet.com.au...
> "Richard Foote" <richard.foote_at_tbigpond.nospam.com> wrote in message
> news:yUIkc.6619$TT.2282_at_news-server.bigpond.net.au...
>
> After reading it, I've got a question for you if I may:
>
> Let's assume a situation where we have a root block, three branch blocks
> and a bucketload of leaf blocks, like in your slide #22.
> Assume as well that this index was created on a sequence-generated
> column (regular but not necessarily consecutive increase). As such in the
> pictorial representation the index leaf blocks would be organised left to
> right in increase value of the key.
>
> Now let's look at the physical (in the disk) distribution of this thing.
> Presumably the root block and the first branch and a few leaf blocks
> would be on consecutive blocks on the disk partition - assuming a raw
> disk, an "intelligent" file system might do other things...
> Followed by another branch block and more leaf blocks,then another branch
> block and more leaf blocks. Correct assumption?
>
> If so, then let's assume that later interspersed insertions of new keys
> cause the third block from the left to split. Like you say in your
slides,
> no problemo: the new block is allocated from index freelist, the initial
> block gets "emptied" 50% into the new one and we do not get an increase in
> index level. Fine and dandy.
>
> But now consider the physical distribution. Isn't that new block coming
> from the freelist? Which might be pointing to a first free block on "the
> other side of the disk", so to speak. This block will now LOGICALLY be
> between the original third and fourth leaf blocks, but physically very
> remote from these two original blocks.
>
> You know where I'm getting at now, don't you? If someone is now doing an
> index range scan - which for the sake of example will involve the third,
> new fourth and old fourth leaf blocks using the block-to-block link
> pointers, wouldn't it involve a heap of I/O wait while the arm got
> repositioned to get the new fourth block from "the other side of the
disk"?
>
> Wouldn't that be a case for an index rebuild, based purely on physical
> distribution rather than hazy "broken b-tree" concepts? What would be
> a way of finding out if this was indeed the case, apart from the obvious
> system I/O wait queue increasing in this device?
>

Hi Nuno

Just to add to Jonathan's comments. I think the issue you raise is not entirely dissimilar to the question of whether it's beneficial to have fewer, larger extents in a segment so that physically it might prove more efficient to read consecutive data blocks. The answer is almost certainly no for similar reasons.

Firstly, having consecutive blocks in an extent doesn't necessarily mean that they have to be consecutive on disk. There may be many reasons why this may not be the case.

However, assuming a raw device as you suggest, we next have the issue of what happens when Oracle performs an index range scan. When Oracle reads the leaf block, it then needs to read the corresponding data blocks in the corresponding table via the obtained rowids. Now there could very well be, and generally are, hundreds of index entries per leaf block, meaning we have to wait the relatively very very long time for all these LIOs to be completed before we're interested in moving onto the next leaf block. Remember, we read one leaf block, then read hundreds of table blocks, read the next leaf block, read hundreds of tables blocks, etc.

Now if the leaf blocks are likely to be physical I/Os (hence your concern), then there is a very very good chance that the corresponding reads to the table are also likely to be physical I/Os as indexes by nature of their size generally have a greater likelihood of having cached blocks than their generally larger tables. So it's likely then that we have to wait for a good proportion of hundreds of PIOs before we concern ourselves with the next leaf block. As I mention in my presentation, the clustering factor of the table has a big influence in how many PIOs might eventuate.

And this then brings us round to the third issue. Most databases are multi user environments and while we wait for the current index leaf block to be processed and for all the corresponding table blocks to be accessed, there is a very probable likelihood of other users or processes needing to perform some other PIO on the device. So while we wait for all this, some other bugger has requested read (or write) access somewhere else on the device, resulting in that good old, slow, mechanical arm being moved anyways from the ideal location we might like. Note this other bugger could very well be ourselves if we store index, or table, or rollback, etc. data on the same device.

This is "contention" is highly probable while processing one leaf block and waiting to move onto the next. The chances that an entire index range scan can be performed uninterrupted from contention on the device by any other processes is almost certain zero on any remotely multi-user environment. This is the key point. We are not the only user process on the database, nor the only process wanting to use that same physical device.

So although yes, index splits can indeed result in the physical result you describe, the chances that it will actually impede and adversely affect the performance of index range scans in any measurable manner in multi-user database environments are practically zero. Having all your index leaf blocks consecutively on disk will likely take exactly the same time to read compared with having an index with leaf blocks that are not so consecutive for this very reason.

Again, it would be an easyish thing to test. Rebuild such an index and see what effect it has on response times. My predication is that within the boundaries of index rebuilds as I describe in the white paper, the effect would be negligible.

Hope it makes sense.

Cheers

Richard Received on Sat May 01 2004 - 09:18:29 CDT