RE: Hot index block split on a very busy table--what is the impact

From: Tanel Poder <tanel.poder.003_at_mail.ee>
Date: Thu, 31 Aug 2006 00:16:08 +0800
Message-id: <000901c6cc4f$9a431c90$6501a8c0@windows01>

You're on 9i or higher, right?

One possibility is that this is caused by shared CBC latching peculiarities:

during normal selects your index root block can be examined under a shared cache buffers chains latch. So as long as everybody is only reading the index root block, everybody can do it concurrently (without pinning the block). The "current holder count" in the CBC latch structure is just increased by one for every read only latch get and decreased by one on every release. 0 value means that nobody has this latch taken currently.

Nobody has to wait for others for reading index root block in all read only case. That greatly helps to combat hot index root issues.

2) Now if a branch block split happens a level below the root block, the root block has to be pinned in exclusive mode for reflecting this change in it. In order to pin a block you need to get the corresponding CBC latch in exclusive mode.
If there are already a bunch of readers on the latch, then the exclusive latch getter will just flip a bit in the CBC latch structure - stating it's interest for exclusive get.

Every read only latch get will check for this bit, if it's set, then the getters will just spin instead, waiting this bit to be cleared (they may yield or sleep immediately as well, I haven't checked). Now the exclusive getter has to spin/wait until all the shared getters have released the latch and the "current holder count" drops to zero. Once it's zero (and the getter manager to get on to CPU) it can get the latch, do its work and release the latch.

During all that time starting from when the "exclusive interest" bit was set, nobody could access this indexes root block except the processes which already had the latch in shared mode. Depending on latch spin/sleep strategy for this particular case and OSD implementation, this could mean that all those "4000 readers per second" start just spinning on that latch, causing heavy spike in CPU usage and they all queue up.

How do diagnose that:

You could sample v$latch_misses to see whether the number of "kcbgtcr: kslbegin shared" nowaitfails/sleeps counter takes an exceptional jump up once you observe this hiccup.

How to fix that once diagnosed:

The usual stuff, like partitioning if possible or creating a single table hash cluster instead.

If you see that the problem comes from excessive spinning, think about reducing the spinning overhead (by reducing _spin_count for example). This could affect your other database functions though..

If you can't do the above - then if you have off-peak time, then analyse indexes (using treedump for start) and if you see a block split coming in a branch below root block, then force the branch block to split during off-peak time by inserting carefully picked values into the index tree, which go exactly in the range which cause the proper block to split. Then you can just roll back your transaction - the block splits are not rolled back nor coalesced somehow, as this is done in a separate recursive transaction.

Tanel.

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Zhu,Chao
Sent: Wednesday, August 30, 2006 21:53
To: oracle-l_at_freelists.org
Subject: Hot index block split on a very busy table--what is the impact

hi, all,

We have a few database with very high concurrent access busy tables, that some indexes of the busy tables could be accessed 1.5K-4K times per second. (PK lookup)

I am wondering , when such index got block split (at the root, or branch level), what will be the impact on the system.

Index block split is said to be an expensive operation, during the block split on branch/root block, what the other sessions that is doing select based on this index, be waiting on? There is a wait event named: Index Block Split with p1,p2,p3 pointing to block address, level.

Document "Description of oracle 7 Wait event and enqueue" says it will yield CPU.

We have a few production incident when load suddenly jump from 10, to several hundred, and during that time nothing can be done. From our monitoring tools, it is mainly "latch free" contention inside oracle, and it is cache buffer chains. The load spike matches the oracle description of block split, but the wait event does not match. And because the load spike time is very short, we can't capture the actual wait event/p1,p2,p3 during the exact time of the load spike time.
Anyone have similar problem/insight into this issue? Is there some v$ view to track, (v$segstat does not) have log about which index go through how many time block split? Or any other effecitve way?

--

Regards
Zhu Chao
www.cnoug.org

--

http://www.freelists.org/webpage/oracle-l Received on Wed Aug 30 2006 - 11:16:08 CDT