RE: mostly latch-free SCN?

From: Cary Millsap <cary.millsap_at_hotsos.com>
Date: Fri, 10 Dec 2004 10:02:05 -0600
Message-ID: <00ad01c4ded1$9b2b3dd0$6501a8c0@CVMLAP02>

John,

You're describing the too-many-LIOs problem in more and more detail. = Don't
be confused; just because the Oracle kernel's not emitting data every = few
microseconds doesn't mean it's not doing tons of work.

Two factors about this problem often mess people up. When your system is killing itself on LIO calls (that is, on accesses to the database buffer cache), there are two things you might notice: (1)=A0The Oracle kernel = emits
lines to the trace file only once per few minutes, and (2)=A0A single = "wait"
row lingers in V$SESSION_WAIT for several minutes. The combination of = these
two observations confuses a lot of people, because it looks like a = single
timed event is to blame for the problem (that event that is the last on = in
the trace file before the huge gap in time, or the one that's apparently "stuck" in V$SESSION_WAIT for the session). But it's not.

There are two things you should realize. First, when the Oracle kernel = emits
the record of a timed event to the trace file, that event is FINISHED. = Its
elapsed duration is the value of the 'ela' field (not the duration of = the
apparent "gap" in time that occurred after the event completed). If its = ela
number is small, then that event's execution is not to blame for the response time problem. Second, with event 10046, the Oracle kernel = doesn't
emit anything for individual LIO processing. You'll see a record of how = many
LIOs were executed by a dbcall in a PARSE, EXEC, or FETCH line (in the = cr+cu
total), but never in a WAIT line.=20

However, it's easy to create a record of LIO work while the slow program = is
running. Just do this:

	select block_gets, consistent_gets
	from v$sess_io
	where sid=3D:sid

Execute this once in SQL*Plus, and then just type "/<enter>" every few seconds, you'll be able to see the LIO count accumulate by the = thousands.
For a little more detail, see "Optimizing Oracle Performance" on pages = 189
and 192-194. (There's also event 10200, which shows a 4-line x 80-byte record of each LIO if you're interested. Unlike 10046, 10200 is far too expensive to use in a production environment.)

Again, John, the bottom line here is that you need to fix some SQL. By = the
time you're finished doing that, you should have SQL that consumes no = more
than about 10 LIO calls per row returned per table in your FROM clause. = For
example, a query that's supposed to return 7 rows from an 8-way join = should
consume no more than about 560 LIOs. (For those of you who have read the book, you know the caveat about aggregation functions, but the rule of = thumb
is still accurate for the top-level row source operation that's beneath = the
aggregation.)

Instead of the number "10", right now, John, you're probably enduring billions.

If you have no idea how to fix your SQL (whether it's your own custom = code,
or part of an application you've purchased), a thread on this list that occurred earlier in the week revealed a couple of good options for = learning
how to do the job.

Good luck to you.

Cary Millsap
Hotsos Enterprises, Ltd.
http://www.hotsos.com
* Nullius in verba *

Upcoming events:

- Performance Diagnosis 101: 1/4 Calgary
- SQL Optimization 101: 2/7 Dallas
- Hotsos Symposium 2005: March 6-10 Dallas
- Visit www.hotsos.com for schedule details...

-----Original Message-----
From: oracle-l-bounce_at_freelists.org =
[mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of John Clarke
Sent: Friday, December 10, 2004 6:38 AM
To: cary.millsap_at_hotsos.com; Oracle-L_at_freelists.org Subject: RE: mostly latch-free SCN?

Thanks for the description of the *** lines. I should have done more research before asking this.

Here's why I originally posted the question, just so you know my frame = of
mind. I run a problematic session that had been hurting this client's nightly batch load for months. I knew we had some SQL optimization = issues
and was going to go through the exercise of training their DBAs how to trace, analyze, etc. While running this thing I was showing them what's = in
V$SESSION_WAIT and noticed the session waiting on latch free waits, = p2=3D88
(which on 9.2.0.4 is the mostly latch-free SCN latch, as I learned). I = then
started examining the trace file and while tailing it, saw it "stuck" on this latch free wait. I expected the next line in the trace file to = have
either a large ela value on a named wait event or a high CPU/elapsed = value
for an LIO.

After waiting about 10 minutes, the trace file started dumping more data = and
the next 100 or so lines were mbr waits, all with relatively small = elapsed
times. I guess I expected a "big value" following this latch 88 wait. = I
suppose the sum of the consecutive db file scattered read waits that followed this latch wait will have a long elapsed time, but typically = don't
you see these waits "stream" to the trace file while the session is = waiting
on them? And if they're not streaming w/ explainable wait times, what = then?

I admit I didn't check the next fetch/exec/parse line for this - I just expected the next posted line to the trace file would have high values. = If
this next "cpu" line does have high elapsed/cpu times, as we would = expect,
what does that mean about all the db file scattered read waits between = the
time the trace file was not being written to (10 minutes) before the = next db
call? I think you've got something about this in your book but don't = recall
(and don't have it with me).

When I get back online at the office I'll double-check and hopefully = things
will become more clear.

Thanks

-----Original Message-----
From: oracle-l-bounce_at_freelists.org =
[mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of Cary Millsap
Sent: Thursday, December 09, 2004 10:26 PM To: Oracle-L_at_freelists.org
Subject: RE: mostly latch-free SCN?

John,

Your 'latch free' excerpt is microscopic and definitely not a concern in =
=3D

the
excerpt. That particular type of 'latch free' event has contributed only 0.000049 seconds to the approximately 6.5 MINUTES represented by your =
=3D

trace
data. Pursuing a fix to your latch "problem" will improve your response =
=3D

time
by--at most!--about 0.000013%. So, you're asking the wrong question. :)

The *** lines you're seeing are emitted by Oracle kernel calls to dbms_system.ksdddt(). The Oracle kernel behaves as if there's a function called "write_to_trace()" in the kernel code that handles Oracle =3D extended
SQL trace file writing in the following manner:

{

	static int time_of_prior_write =3D3D 0;
	subroutine write_to_trace(TRACE, text) {
		if now() - time_of_prior_write > 10 seconds {
			ksdddt;
		}
		time_of_prior_write =3D3D now();
		write(TRACE, text);
	}

}

The *** line doesn't mean that your 'latch free' consumed a bunch of =3D time.
The ela value for each timed event shows you how much time that event =
=3D

has
consumed. The *** line is telling you that it's a long time between =3D calls to
Oracle timed events. This almost always indicates that the process being traced is accessing the database buffer cache way too many times. This =
=3D

is
common because so many books, courses, etc. teach you that making sure everything is in the buffer cache is golden goodness.

The dbcall (PARSE, EXEC, FETCH, ...) that immediately /follows/ this =3D excerpt
will have a huge c value (lots of CPU service consumption for the call), =
=3D

and
a huge e value (lots of elapsed time for the call). It will almost undoubtedly also have a huge cr+cu value as well, which will confirm =3D that
you're suffering from, I would estimate, on the order of tens of =3D millions of
unnecessary "consistent gets" and "db block gets."

Bottom line: You have some very inefficient SQL here that you need to =
=3D

fix.

The book "Optimizing Oracle Performance" explains further.

Cary Millsap
Hotsos Enterprises, Ltd.
http://www.hotsos.com
* Nullius in verba *

Upcoming events:

- Performance Diagnosis 101: 1/4 Calgary
- SQL Optimization 101: 12/13 Atlanta
- Hotsos Symposium 2005: March 6-10 Dallas
- Visit www.hotsos.com for schedule details...

-----Original Message-----
From: oracle-l-bounce_at_freelists.org =3D
[mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of John Clarke
Sent: Thursday, December 09, 2004 4:51 PM To: oracle-l_at_freelists.org
Subject: mostly latch-free SCN?

Anyone have any idea what this is=3D3D3F

I was 10046 tracing a process and found it post a wait on mostly =3D latch-f=3D3D
ree SCN latch. Excerpt from trace file below.

WAIT #2: nam=3D3D3D'db file scattered read' ela=3D3D3D 1054 p1=3D3D3D17 =
=3D

p2=3D3D3D30500 p=3D3D
3=3D3D3D3
WAIT #2: nam=3D3D3D'db file sequential read' ela=3D3D3D 1377 p1=3D3D3D17 =
=3D

p2=3D3D3D30504 =3D3D
p3=3D3D3D1
*** 2004-12-09 17:27:55.695
WAIT #2: nam=3D3D3D'latch free' ela=3D3D3D 49 =

p1=3D3D3D-4611686009467115424 =3D
p2=3D3D3D88 =3D3D
p3=3D3D3D0

*** 2004-12-09 17:34:20.573
WAIT #2: nam=3D3D3D'db file sequential read' ela=3D3D3D 4850 p1=3D3D3D50 =
=3D

p2=3D3D3D119591=3D3D
p3=3D3D3D1
WAIT #2: nam=3D3D3D'db file sequential read' ela=3D3D3D 6175 p1=3D3D3D50 =
=3D

p2=3D3D3D119590=3D3D
p3=3D3D3D1

I'm not positive what the "^***" lines mean, but what I can tell is =3D that=3D3D
when hit the latch wait and displayed a wall time of 17:27, it didn't =
=3D

d=3D3D
o anything until the next "^***" line. Shouldn't the ela in either the =
=3D
=3D3D

current or next wait match the elasped time=3D3D3F

Thanks in advance.

--
http://www.freelists.org/webpage/oracle-l

--
http://www.freelists.org/webpage/oracle-l

--
http://www.freelists.org/webpage/oracle-l

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Dec 10 2004 - 10:05:30 CST