How to isolate root cause of increasing libcache latch waits? (Subtitle: Hoping to upstage Oracle Support)

From: Charles Schultz <>
Date: Wed, 14 Oct 2009 08:42:58 -0500
Message-ID: <>

Good day, list,
*The Symptoms*

We have observed a gradual rise in library cache latch waits in our Production database. Oracle Enterprise Manager shows these as small, short-lived "spikes" in the Top Activity Monitor. Our experience is that these spikes generally grow in intensity and duration (impacting more users for a longer period of time). We have observed that turning off OEM and the agent helps to alleviate these latch waits.

*The System*

Oracle EE on Solaris 10. Upgrading to this weekend. Hardware is Sun F15k with 44 CPUs and 96GB of RAM.

*The Support*

In our first SR with Oracle, the "resolution" was to stop using OEM. This is a very poor response in my opinion, for it does not address the root problem at all; it offers no insight as to what is causing the latch waits in the first place, nor is it a very effective way to treat a problem as it merely assuages the situation, not eliminating it.

In our second SR, I have an analyst who wants me to turn on event 10046 level 12. Good and dandy, I would have done that myself if not for the fact that we often get hit with mutex hangs triggered by rowsource execution statistics. Besides, the analyst is simply fishing for something, he has given no indication that he really understands the situation. He has identified several large, cpu-intensive queries; how does cpu-intensive relate to latch waits, especially considering the fact that our box is only about 12% cpu utilized. I admit, perhaps I am especially dense today and he is spot on. Perhaps.

*The Synopsis*

Since it is library cache latches that we are talking about, I looked at the shared pool fragmentation. Since fragmentation generally gets worse as time progresses, it is my theory that perhaps the libcache latch requests are taking longer and longer to fulfill as the there are more (and smaller) chunks of memory to search. I am currently running a couple diagnostic scripts (forgot which Metalink note I got them from, but it is a memory diagnostic suite). We know from past experience that flushing the shared pool has a marginal affect; which makes sense, as x$kssmp shows us that a small fraction of the shared pool is flushable. We also know from past experience that if we let the latch waits continue to get worse, the system approaches a critical mass and we have to bounce it. And nobody likes to bounce the Production system. =)

So, one big (obvious) caveat. We are upgrading to this weekend which means we will inherit a ton of bug fixes, as well as those wonderful new bugs that come along with it. Unfortunately, the PSU ( bundle) came out too late for our regression tests, but I have scheduled that for future regression tests. If memory fragmentation actually is the root problem, I have my doubts as to whether is actually going to help us or not. If anyone has any insights as to whether or not this might be true, I am all ears.

Charles Schultz

Received on Wed Oct 14 2009 - 08:42:58 CDT

Original text of this message