Spotting the real cause of a Production slowdown (10g)

From: Schultz, Charles <sac_at_uillinois.edu>
Date: Thu, 20 Apr 2006 08:14:36 -0500
Message-ID: <565F609E6D736D439837F1A1A797F34171D218@ADMINMAIL1.ui.uillinois.edu>

> We have observed several critical slowdowns of our Production system
> since upgrading to 10g. The symptoms include a rapid rise in library
> cache latch waits (with associated, but less notable, rises in shared
> pool and library cache pin waits), CPU pegged at 100% and CPU load in
> triple-digits, and massive shared pool resizing evidenced by
> v$sga_resize_ops (1.7gb added in 15 seconds). Obviously, trying to run
> queries against the normal v$ views is rather slow (only adds to the
> latching problem). We sent a hanganalyze and systemstate dumps to
> Oracle Support, but after a couple months of trying to figure things
> out, we have word that the initial Support Engineer may have
> misdiagnosed our problem.
>
> Based on what little I know and researched (reading a number of notes
> here, Metalinks, google, oraFAQ, etc), a large number of literal sql
> statements might be able to cause such symptoms. It is just hard for
> me to imagine the number of literal statements that would consume
> 1.7gb of shared pool in 15 seconds. Perhaps it is not quantity, but
> quality as well, each statement consuming a large chunk of real
> estate. The only problem with this conjecture is that I do not find
> evidence of this in v$sql (looking at number of executions, number of
> literals, sharable_mem).
>
> But the puzzle is even more elusive. EM reports extremely (abnormally)
> large number of executions for seemingly random sql statements. In one
> case, we had 3 statements that were each executed 35 million times in
> one hour (2 of those statements had bind variables, the other was a
> "simple" select). I can never catch these large numbers in v$sql, so I
> have to assume they age out by the time we start to look. I have asked
> the application folks to see if these queries are related, and why the
> execution count is a couple orders of magnitude greater than normal.
> Still waiting for resolution at that end. It seems the general
> consensus that these larger numbers are symptomatic as well, that they
> only start to escalate when the system is severely slow and the
> middle-tier starts to timeout. NOTE: that is the working hypothesis,
> we are trying to prove it.
>
> So, my question is two-fold. Do you experts out there have advice as
> to how we can reactively figure out what already happened? And how to
> proactively avoid this from happening again? Because of the automatic
> AWR snapshots only happening each hour, it is hard for us to get more
> granular when looking at a historical view. Perhaps we should
> temporarily increase that?
>
> Any and all advice is gladly welcome. I am hoping to learn some of the
> tricks and glean from your experience.
>
> Thanks in advance,
>
> PS - after reading about Kyle Hailey and Roger Sanders work, anyone
> have access to a DMA for dummies? =)
>
> charles schultz
> oracle dba
> aits - adsd
> university of illinois
>
>

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Apr 20 2006 - 08:14:36 CDT