Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: Wait event problems

Re: Wait event problems

From: Mohammad Rafiq <rafiq9857_at_hotmail.com>
Date: Thu, 13 Jun 2002 09:03:36 -0800
Message-ID: <F001.0047D4BA.20020613090336@fatcity.com>


I think Rachel is referring following email from Carry Millsap.. It is a really nice explanation of this issue.... Regards
Rafiq

I have an example for you (Anjo, I hope you won't mind). A prospect we visited once upon a time had been fighting a performance problem with an Oracle Payroll program. They "knew" what their problem was: very clearly, v$system_event was telling them that their overwhelmingly dominant system performance problem was waits for "latch free." (Well, no it wasn't. It was waits for the Oracle Payroll program. ;) )

Actually, the v$system_event view indicated clearly that the top wait event was "SQL*Net message from client" (and some other "SQL*Net %" stuff), but of course every good DBA knows that you have to discard all the "SQL*Net" events because they are "idle" events.

Three months of trying and failing to fix the problem led finally to a last-resort, desperation hardware upgrade. Twelve 700MHz CPUs became twelve new 1,024MHz CPUs. (Don't ask why "waits for latch free" was translated as "need to upgrade CPUs.") And unfortunately, the Payroll program got *slower*. Not just "seemed slower"--*was* slower.

After the upgrade failed to help, these guys finally let us come see them. We traced the Payroll process (10046, level 8), very carefully initiating the trace immediately after the initiation of the Payroll run, and very carefully terminating the trace immediately after the completion of the Payroll run. The key here is that every second of data in the trace file was a second of somebody's response time.

The result was a really fun story. The thing ran in about 33 minutes. Here's a summary of what was in the session's trace file:

Oracle Kernel Event                      Duration    Calls        Avg

------------------------------ ------------------ -------- ----------
SQL*Net message from client       984.01s   49.6%   95,161  0.010340s
SQL*Net more data from client     418.82s   21.1%    3,345  0.125208s
db file sequential read           279.54s   14.1%   45,085  0.006200s
CPU service                       248.69s   12.5%  222,760  0.001116s
unaccounted-for                    27.73s    1.4%
latch free                         23.69s    1.2%   34,695  0.000683s
log file sync                       1.09s    0.1%      506  0.002154s
SQL*Net more data to client         0.83s    0.0%   15,982  0.000052s
log file switch completion          0.28s    0.0%        3  0.093333s
enqueue                             0.25s    0.0%      106  0.002358s
SQL*Net message to client           0.24s    0.0%   95,161  0.000003s
buffer busy waits                   0.22s    0.0%       67  0.003284s
db file scattered read              0.01s    0.0%        2  0.005000s
SQL*Net break/reset to client       0.00s    0.0%        2  0.000000s
------------------------------ ------------------ -------- ----------
Total                           1,985.40s  100.0%

Well, waits for "latch free" accounted for only 1.2% of the total response time. If the DBAs had been completely successful in eliminating "latch free" waits from their system, it would have made only a 1.2% difference in the runtime of this program. Point 1: <emphasis>This program's bottleneck is not the same as its system's system-wide average bottleneck.</emphasis> This happens all the time. You can NOT see this type of problem when you look only at system-wide average statistics.

You probably didn't pay much heed to the first thing you should have noticed: slightly more than 70% of this program's runtime was consumed by so-called "server idle" time. ??! You can't ignore this time, even if people do call it "idle." Idle or not, this time was part of someone's response time, so we need to deal with it. Now if we hadn't collected our statistics so carefully (with proper time scope and proper program scope), then we would have seen probably much more "SQL*Net message from client" time in our data. If you make that particular collection error, then you *have* to disregard the so-called "idle" wait time.

What was the Payroll program's problem? A couple of things. First, we knew that the Payroll program was running on the same host as Oracle. But the latency per call on the "SQL*Net message from client" event looked suspiciously LAN-like (order of 10ms), not IPC-like (order of 1ms or less). So we checked the tnsnames.ora file. Turns out that to conserve administration effort, the DBAs had decided to use a single tnsnames.ora file system-wide. So the Payroll program was using the same TCP/IP protocol that the system's PC clients were using. Just by switching the protocol to BEQ instead of TCP in the db server's tnsnames.ora file (a change that took about 15 minutes to test, 20 seconds to implement, and about a week to navigate through change control), we eliminated about 50% of the program's runtime.

Next, what's with the "SQL*Net more data from client"? The developers of this particular Payroll program did a pretty bad thing in their code: they passed several really long (10KB) SQL texts from the client to the server in their parse calls (oh yeh, and with no bind variables either). Anyway, each time the Oracle kernel received a packet containing only part of a SQL statement, it tallied to "SQL*Net more data from client" while it waited for the next piece. Of course, communicating via the network interface card instead of IPC didn't help, but passing huge SQL statements from the app to the db is bad form to begin with (see Ixora for more detail on just how bad this is).

Finally, realize that Oracle tallies to "SQL*Net message from client" pretty much every time it's "between" working on db call fulfillment. Therefore, there were on the order of 95,161 db calls during this 33-minute period. Looking at the SQL, we discovered that of course, since the app developers didn't use bind variables, there was a parse call for almost every execute. This accounted for thousands of unnecessary parse calls. Next, each execute call manipulated only zero or one row, which added up to dozens of execute calls for each single SQL statement with a multi-row result source. The program could have performed the business function with a fraction of the db calls.

Oh yes... Why did the Payroll program get *slower* after the upgrade? Little Payroll program time was spent using CPU, so the upgrade had very little direct positive effect upon the program. Most of the program's time was spent waiting for network latency. Other programs ran at the same time as this Payroll job. The CPU upgrade made *them* faster, which intensified *their* constant number of network calls into a smaller time window. The result was increased competition for the network during the Payroll run. Therefore, every network I/O call the Payroll call made was a little slower than before the CPU upgrade. The increase in network response time overwhelmed the tiny direct improvement of the CPU time reduction, resulting in a net degradation of Payroll performance. ...Not a good thing, because this Payroll program had a higher business priority than everything else on the system at the same time.

The End.

So, some wrap-up:

  1. Looking at v$session_event is better than looking at v$system_event for these things. When you collect system-wide data to solve a single program's performance problem, your irrelevant data can bury your relevant data. We see this all the time. Any time you query v$system_event and ask about the top-line events you see, your vulnerable to this problem.
  2. Furthermore, looking even at v$session_event tells you NOTHING about the user's experience. When a v$ query tells you that there's "SQL*Net message from client" time, you can't know without more data whether this stuff showed up in someone's response time, or if it corresponded to time when the USER was actually idle.
  3. Another important reason that looking at v$session_event data can mislead you is that there's no record of CPU time in there. To find that, you have to go to something like v$sesstat. The "CPU used by this session" statistic is supposed to give you the CPU time you need, but various bugs prevent you from getting what you need in many cases.
  4. Response time = CPU from sesstat + waits from session_event + an unaccounted-for gap in time. Lots of people believe that the "gap" is a deficiency of 10046 data: it's not. The irony is that the same deficiency is even worse in v$ table methods; you just can't tell that there even *is* a gap: you can't compute gap=R-(cpu+waits), because with v$-based methods you don't know what R is. This "gap" is one of those things that we've discovered how to interpret reliably after some work. For more information, see some of our Hotsos Profiler documentation at www.hotsos.com/dnloads/1/kevents and click "unaccounted-for" (the document is free, but you have to register). (Bear with us, most of the event documentation on our kevents page is pre-beta, but the "unaccounted-for" document is pretty much complete.)
  5. How Precise gets around all those problems is by polling Oracle shared memory (the things underneath the v$ tables) with extraordinary frequency in a "lightweight way" (directly from memory). How Hotsos gets around them is by using 10046 level-8 trace data. But to use 10046, you must be able to predict that a performance problem will occur in the near future (for example, because it's repeatable). We teach both data collection approaches in our Clinic, but we have relied upon 10046-based analysis exclusively (using our own Hotsos Profiler tool) since 2000.
  6. No matter how you collect the data, as long as you do it correctly, the response time-based optimization method works with extraordinary efficiency. Since about 1995, Anjo (whom I regard to be the Father of the response-time based Oracle optimization method) has been able within a few minutes to find and fix performance problems that have evaded detection for months or even years. It's rare anymore for any of us to spend more than about an hour diagnosing and repairing a performance problem once we have the right data.

...But the hit ratio stuff lives on, I think for two reasons: (1) Not enough people understand the response-time based diagnostic methods yet (but we're working on that!); and (2) Ratio-based techniques, when combined with time & materials-based billing contracts, are extremely effective at maximizing consulting revenue.

Cary Millsap
Hotsos Enterprises, Ltd.
cary.millsap_at_hotsos.com
http://www.hotsos.com

Reply-To: ORACLE-L_at_fatcity.com
To: Multiple recipients of list ORACLE-L <ORACLE-L_at_fatcity.com> Date: Thu, 13 Jun 2002 08:44:04 -0800

You may want to search the archives of the list as well, I seem to recall Cary Millsap posting something not too long ago where they found that this "idle" wait event was a real problem.

If you can't find it, I can try to dig it out of my files

Rachel


Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Rachel Carmichael
   INET: wisernet100_at_yahoo.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).




_________________________________________________________________
Chat with friends online, try MSN Messenger: http://messenger.msn.com

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Mohammad Rafiq
  INET: rafiq9857_at_hotmail.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
Received on Thu Jun 13 2002 - 12:03:36 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US