Re: resolving buffer busy waits

From: Casey <cdyke_at_corp.home.nl>
Date: 15 Sep 2003 08:22:33 -0700
Message-ID: <8bc6b8d7.0309150722.3e002b14@posting.google.com>

thx for the response Jonathan, comments inline.

>
> b) Your average read times for these tablespaces
> appear to be massive - normally I assume that
> this is a bug where oracle is counting time in an
> unsuitable way - but maybe you really do have
> a peculiar I/O problem.

yup, that's what i thought. hard to believe the numbers. these two tablespaces are busiest in the system. but others are busy and their average read times are very much normal. so looking at numbers only -- the "issue" seems to be entirely focused on these two datafiles.

>
> c) Your number and times on enqueues is huge -
> which may be end-user code and UL locks, but
> might be an issue with distributed transactions.
>

app is of questionable integrity, but we're stuck w/it!

>
> In your case, I would tend to assume that the pure
> I/O load had to be addressed first, as it might fix the
> BBW as a side-effect. I would also investigate why
> you enqueues are so expensive because that might
> be a totally separate problem that also needs to be
> addressed. I would not, initially, spend much time
> trying any of the 'hints and tips' fixes for buffer busy
> waits.
>

i like that comment very much. am not eager to play w/the bbw problem until i have really identified it _as_ the problem. and that i haven't, yet.

what i can add (or expand on my first post) is the following:

all file systems are UFS
this is due to usage of cluster software that was imcompatible w/veritas
UFS at 2.8 can make use of forcedirectio option, but this caused issues early on in the project (last year) and was turned off
archive file systems are striped in w/the datafile file systems

very early on we said this combination was a disaster waiting to happen(cluster included), but we lost, unfortunately.

now, what i can also add is that this problem has sort of "crept" up. monitoring statspack reports has seen average read times creep up from low double digit millisecond times up to the massive ones now. this has occurred rapidly in the past 3 wks and sort of "stablised" at the silliness i see now. however, there were odd spikes early this year too. so if it was an underlying issue, it seems fair to assume these numbers should always be "odd". but maybe that's something about oracle i have yet to encounter!

so on one hand i have data indicating it appears to be some sort of rapid creep associated with -- perhaps -- load. but on the other, it looks like the potential for "odd numbers" has always been there.

and here's something to either laugh at or simply ponder: we have two datafile file systems - one had very recently jumped to 92% capacity after a normal growth extension. at a stretch, i decided to relocate a file, taking that file system down to 87%. these odd numbers w/in oracle did drop, but are still whacky -- however, we nearly doubled the number of checkpoints completed in an avg 24 hour period that evening after the outage. that throughput increase manifested itself in much higher application throughput and has been sustained since. IO problem you say?

in summation: there are a lot of oddities here. but i have to tread carefully.

thx again for you comments Jonathan. very interested to see if the extra info provided triggers more comments.

ah - nuno - no NFS ... that i can say w/certainty!

cheers,

casey ... Received on Mon Sep 15 2003 - 10:22:33 CDT