Re: resolving buffer busy waits

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Mon, 15 Sep 2003 20:23:52 +0100
Message-ID: <bk53ig$bk1$1$8300dec7@news.demon.co.uk>

I am fairly sure you have a filesystem problem, and your BBWs are a side-effect. Your comments about changes in throughput after moving files tends to confirm this, and made me go back to the original trace:

191,000 reads at 9,353 ms/read
plus
44,400 reads at 2,675 ms/read

totals roughly 192,000,000 cs - which is what your v$system_event reports as
waits for disk reads.

You always have to assume that such
things are coincidences, and not trust
them too much - but it's a very suspicious looking coincidence.

(And your BBWs are due to sessions waiting for other sessions to finish reading required blocks - if a block read take 4.5 seconds, BBWs are very likely to query up behind it in a busy system.)

Get on to the Unix S/A and see if they can find errors on controllers or disks.

--
Regards

Jonathan Lewis
http://www.jlcomp.demon.co.uk

  The educated person is not the person
  who can answer the questions, but the
  person who can question the answers -- T. Schick Jr


One-day tutorials:
http://www.jlcomp.demon.co.uk/tutorial.html

____Finland__September 22nd - 24th
____Norway___September 25th - 26th
____UK_______December (UKOUG conference)

Three-day seminar:
see http://www.jlcomp.demon.co.uk/seminar.html
____USA__October
____UK___November


The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html


"Casey" <cdyke_at_corp.home.nl> wrote in message
news:8bc6b8d7.0309150722.3e002b14_at_posting.google.com...


> thx for the response Jonathan, comments inline.

>

> >

> > b)  Your average read times for these tablespaces

> >     appear to be massive - normally I assume that

> >     this is a bug where oracle is counting time in an

> >     unsuitable way - but maybe you really do have

> >     a peculiar I/O problem.

>

> yup, that's what i thought.  hard to believe the numbers.  these two

> tablespaces are busiest in the system.  but others are busy and

their


> average read times are very much normal.  so looking at numbers only

> -- the "issue" seems to be entirely focused on these two datafiles.

>

> >

> > c) Your number and times on enqueues is huge -

> >     which may be end-user code and UL locks, but

> >     might be an issue with distributed transactions.

> >

>

> app is of questionable integrity, but we're stuck w/it!

>

> >

> > In your case, I would tend to assume that the pure

> > I/O load had to be addressed first, as it might fix the

> > BBW as a side-effect.  I would also investigate why

> > you enqueues are so expensive because that might

> > be a totally separate problem that also needs to be

> > addressed.  I would not, initially, spend much time

> > trying any of the 'hints and tips' fixes for buffer busy

> > waits.

> >

>

> i like that comment very much.  am not eager to play w/the bbw

problem


> until i have really identified it _as_ the problem.  and that i

> haven't, yet.

>

> what i can add (or expand on my first post) is the following:

>

> - all file systems are UFS

> - this is due to usage of cluster software that was imcompatible

> w/veritas

> - UFS at 2.8 can make use of forcedirectio option, but this caused

> issues early on in the project (last year) and was turned off

> - archive file systems are striped in w/the datafile file systems

>

> very early on we said this combination was a disaster waiting to

> happen(cluster included), but we lost, unfortunately.

>

> now, what i can also add is that this problem has sort of "crept"

up.


> monitoring statspack reports has seen average read times creep up

from


> low double digit millisecond times up to the massive ones now.  this

> has occurred rapidly in the past 3 wks and sort of "stablised" at

the


> silliness i see now.  however, there were odd spikes early this year

> too.  so if it was an underlying issue, it seems fair to assume

these


> numbers should always be "odd".  but maybe that's something about

> oracle i have yet to encounter!

>

> so on one hand i have data indicating it appears to be some sort of

> rapid creep associated with -- perhaps -- load.  but on the other,

it


> looks like the potential for "odd numbers" has always been there.

>

> and here's something to either laugh at or simply ponder: we have

two


> datafile file systems - one had very recently jumped to 92% capacity

> after a normal growth extension.  at a stretch, i decided to

relocate


> a file, taking that file system down to 87%.  these odd numbers w/in

> oracle did drop, but are still whacky -- however, we nearly doubled

> the number of checkpoints completed in an avg 24 hour period that

> evening after the outage.  that throughput increase manifested

itself


> in much higher application throughput and has been sustained since.

> IO problem you say?

>

> in summation: there are a lot of oddities here.  but i have to tread

> carefully.

>

> thx again for you comments Jonathan.  very interested to see if the

> extra info provided triggers more comments.

>

> ah - nuno - no NFS ... that i can say w/certainty!

>

> cheers,

>

> casey ...

Received on Mon Sep 15 2003 - 14:23:52 CDT