Re: intermittent very high waits in LGWR on Linux?

From: bugbear <bugbear_at_trim_papermule.co.uk_trim>
Date: Fri, 01 Jul 2005 14:34:27 +0100
Message-ID: <42c54663$0$2024$ed2e19e4@ptn-nntp-reader04.plus.net>

Noons wrote:
> bugbear apparently said,on my timestamp of 1/07/2005 9:14 PM:
>

>> But the spread...
>> 89% are under 29177 microseconds AKA 29 milliseconds
>> with a "reasonable" spread.
>>
>> But the remaining 11% are over 986468 microseconds,
>> which is extraordinarily close to 1 second.
>>
>> Indeed, there are only 3 times (out of 4922)
>> above 29177 but below 986468.
>>
>> It seems that I either get "correct" redo
>> log write out, with times varying from 53 to 29177
>> microseconds, or I "fallback" to some kind of quantized
>> timeout write behavior, driven by a 1 second clock.
>>
>> This is gettin' weird.

>
>
> Not really. I think it has to do with the fs cache
> flush and the remaining parameters in your spfile.
>
> If I were you I'd take the timings of the approx 80% and ignore
> the others. Once you decide on a spread for running a real
> live test with a more realistic config, you'll be able to
> get rid of those last 20% with a complete setup geared for
> Oracle 9i.
>
> One of the development targets for 10g was precisely to make
> it perform significantly better on default setup,
> resource-limited systems. This, so that first time users would
> get a "better" impression of the product.
>
> Previous releases (9i included) were notorious for default
> setups that were nothing short of moronic. This situation got
> aggravated with the SPFILE as it is now binary data and therefore
> not obvious what is inside it. Hence the CREATE PFILE FROM
> SPFILE incantation Holger referred to: it's the most expedient
> way to dump ALL parameters set to anything other than
> default.
>
> Or you can try to SELECT the NAME and VALUE columns from the
> view V$PARAMETER. You wouldn't believe some of the dumb values
> 9i defaults to! It could also well be in archivelog mode,
> which will slow you down periodically on a single disk system.
>
> You'll be able to get similar performance to 10g, it just
> needs a bit more attention to detail. Which is probably
> hidden at this stage behind the "everything in the same
> f/s, default install" syndrome.
>
> It's a common occurrence. Hence my recommendation you take
> your timings from the 80% as the typical results on a
> properly setup system. The purpose of making your redo logs
> larger was precisely to try and highlight bottlenecks on
> switching redos: one of the most common performance traps
> before 10g.
>
> Bottom line: take the bad 20% or so with a very large grain of
> salt and extrapolate based on the 80%. 9i can indeed be
> tuned for more even performance but you probably do not want
> to do that at this stage: not worth it.
>

<<other good stuff read, digested and snipped>>

I think I'm up against a bug. I finally took a step back, stopped looking at Oracle, and looked at the machine.

This is not (quite) as odd as it sounds, since the machine is over on a rack, quote a way from me.

A quick RPM later gaves me the Linux version of iostat.

Running iostat -k1 whilst running my slow test gives (sample snapshot)

avg-cpu: %user %nice %sys %iowait %idle

0.00 0.00 0.00 0.00 100.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
dev3-0            3.00         4.00         8.00          4          8

avg-cpu:  %user   %nice    %sys %iowait   %idle
            1.00    0.00    0.00    0.00   99.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
dev3-0           12.00         4.00       112.00          4        112

avg-cpu:  %user   %nice    %sys %iowait   %idle
            1.00    0.00    2.00    0.00   97.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
dev3-0            7.00         8.00        28.00          8         28

avg-cpu:  %user   %nice    %sys %iowait   %idle
            2.00    0.00    1.00    0.00   97.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
dev3-0           13.00        16.00        64.00         16         64

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.00    0.00    1.00    0.00   99.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
dev3-0            7.00         4.00        36.00          4         36

Oracle is making heavy use of neither CPU nor IO!!!! (and neither is anything else...)
It appears that the "log file sync waits" I'm seeing are more like sleeps(). It ain't even tryin'.

Since the LGWR is a separate process, I start to (again) suspect Linux scheduling.

BugBear Received on Fri Jul 01 2005 - 08:34:27 CDT