Re: intermittent long "log file sync" waits

From: Chris Stephens <cstephens16_at_gmail.com>
Date: Tue, 28 Jan 2020 15:02:59 -0600
Message-ID: <CAEFL0szrecc2cV3FoQMULngy-374vpa+odGVr-_hVC2jiXQZ8A_at_mail.gmail.com>



I figured it would. :) This application is going to be the death of me...but I guess I'll be learning all the way to the grave. ha.

maybe i should enable 10046 for lgwr? luckily this isn't production. I'm not too worried about unintended consequences of tracing lgwr.

On Tue, Jan 28, 2020 at 2:57 PM Cary Millsap <cary.millsap_at_method-r.com> wrote:

> Chris, I recognize your Method R Workbench output there :-).
>
> Recently, Jeff and I saw the same kind of behavior. Client had
> intermittent problems with a little program that usually runs in a second
> or so, but every once in a while—about six times per day—it would hang for
> 25+ seconds. Client traced instance-wide for about six hours. We ended up
> with several thousand trace files, but with Method R Workbench 9 (beta, but
> coming soon), we were able to find what we were looking for in just a few
> minutes. We found that each of the 25+ second executions were spending 24+
> seconds of "log file sync" that the short-duration programs didn't have.
>
> Happily, we had the trace file for LGWR, and we could line up the long
> "log file sync" calls in the user programs *precisely* on the
> timeline with "log file parallel write" calls in the LGWR trace, which were
> taking—you guessed it—25+ seconds apiece. Weird thing was, some of these
> super long-latency writes were processing only 2KB of data apiece. This was
> on IBM AIX. The AIX sysadmins couldn't see anything wrong in their
> aggregations of all write call latencies (which you wouldn't—it's easy to
> hide a half a dozen 25.000's in a list of millions of 0.003's). I'm not
> sure they ever really tackled the problem of why their hardware writes were
> taking so long.
>
> One hypothesis that Jeff and I were left with at the end was that perhaps
> the application we were trying to diagnose was having a noisy neighbor
> problem with some other system on some other VM on the same box.
>
>
> Cary Millsap
> Method R Corporation
> Author of *Optimizing Oracle Performance <http://amzn.to/OM0q75>*
> and *The Method R Guide to Mastering Oracle Trace Data, 3rd edition
> <https://amzn.to/2IhhCG6+-+Millsap+2019.+Mastering+Oracle+Trace+Data+3ed>*
>
>
>
> On Tue, Jan 28, 2020 at 2:31 PM Stefan Koehler <contact_at_soocs.de> wrote:
>
>> Hello Chris,
>> these are the most fun problems - I love to troubleshoot sporadic LGWR
>> issues :-)
>>
>> I just read through the whole thread quickly and here are my two cents:
>>
>> 1) Can you profile the extended SQL trace file with an appropriate
>> profiler (e.g. Method-R Profiler or TVD$XTAT, the latter is free) to get an
>> exact wait event histogram of the log file sync waits for the slow run?
>>
>> 2) You don't see any ASH data for LGWR process for this particular time
>> frame, right? Might it be possible that the scalable log writer mode kicked
>> in and the redo work is done by the worker processes? Just check the LGWR
>> trace file and you should see which mode (single log writer mode/scalable
>> log writer mode) was used. If it is the scalable one you have to take a
>> look at the lg<nn> processes.
>>
>> 3) You see "log file parallel write" for your loop workload which might
>> not be a surprise as you need to write redo. However if you see such large
>> latency spikes (basically in the same ball park as the "log file sync" ones
>> - see point 1) it is very likely that you got an OS config problem (e.g.
>> Linux kernel AIO slots, scheduler, device queue, etc.) or a storage
>> problem. You can troubleshoot these "log file parallel write" syscalls
>> incl. wchan with Tanel Poder's psnapper (
>> https://github.com/tanelpoder/psnapper) and see where it is
>> stuck/waiting.
>>
>> Have fun troubleshooting :-)
>>
>> Best Regards
>> Stefan Koehler
>>
>> Independent Oracle performance consultant and researcher
>> Website: http://www.soocs.de
>> Twitter: _at_OracleSK
>>
>> > Chris Stephens <cstephens16_at_gmail.com> hat am 28. Januar 2020 um 17:09
>> geschrieben:
>> >
>> >
>> > 3-node Oracle 19.3 RAC
>> > Centos 7
>> >
>> > We have a SQLAlchemy/Python based application workload that is running
>> the exact same steps with widely varying response times which appear to be
>> related to varying "log file sync" wait times.
>> >
>> > Here is a profile of a "fast" run:
>> >
>> > CALL-NAME DURATION % CALLS MEAN
>> MIN MAX
>> > ------------------------------ --------- ------ ------ --------
>> -------- ---------
>> > SQL*Net message from client 53.197782 91.8% 10,092 0.005271
>> 0.000177 28.568493
>> > EXEC 3.759177 6.5% 9,816 0.000383
>> 0.000000 0.239592
>> > row cache lock 0.233153 0.4% 541 0.000431
>> 0.000113 0.000941
>> > PARSE 0.140399 0.2% 4,867 0.000029
>> 0.000000 0.006620
>> > DLM cross inst call completion 0.137330 0.2% 956 0.000144
>> 0.000004 0.000505
>> > library cache lock 0.100171 0.2% 215 0.000466
>> 0.000151 0.002133
>> > library cache pin 0.079729 0.1% 216 0.000369
>> 0.000056 0.000710
>> > FETCH 0.058253 0.1% 1,062 0.000055
>> 0.000000 0.004148
>> > log file sync 0.048217 0.1% 149 0.000324
>> 0.000259 0.000505
>> > CLOSE 0.045416 0.1% 4,929 0.000009
>> 0.000000 0.000073
>> > 20 others 0.135624 0.2% 11,854 0.000011
>> 0.000000 0.001700
>> > ------------------------------ --------- ------ ------ --------
>> -------- ---------
>> > TOTAL (30) 57.935251 100.0% 44,697 0.001296
>> 0.000000 28.568493
>> >
>> >
>> > Here is a profile of a "slow" run:
>> >
>> > CALL-NAME DURATION % CALLS MEAN
>> MIN MAX
>> > ------------------------------ ---------- ------ ------ --------
>> -------- ----------
>> > SQL*Net message from client 131.186118 61.0% 10,092 0.012999
>> 0.000212 106.789360
>> > log file sync 79.291166 36.8% 150 0.528608
>> 0.000264 2.986575
>> > EXEC 3.728402 1.7% 9,816 0.000380
>> 0.000000 0.221403
>> > row cache lock 0.248868 0.1% 542 0.000459
>> 0.000111 0.001036
>> > PARSE 0.164267 0.1% 4,867 0.000034
>> 0.000000 0.004652
>> > DLM cross inst call completion 0.146981 0.1% 957 0.000154
>> 0.000005 0.001188
>> > library cache lock 0.104354 0.0% 218 0.000479
>> 0.000160 0.000728
>> > library cache pin 0.082504 0.0% 202 0.000408
>> 0.000157 0.000672
>> > FETCH 0.056687 0.0% 1,062 0.000053
>> 0.000000 0.003969
>> > CLOSE 0.043590 0.0% 4,929 0.000009
>> 0.000000 0.000180
>> > 20 others 0.142044 0.1% 11,866 0.000012
>> 0.000000 0.001792
>> > ------------------------------ ---------- ------ ------ --------
>> -------- ----------
>> > TOTAL (30) 215.194981 100.0% 44,701 0.004814
>> 0.000000 106.789360
>> >
>> >
>> > The weird thing is that I don't see corresponding log I/O waits
>> (awaits) in iostat output.
>> >
>> > I have a ticket open w/ oracle but does anyone have any suggestions to
>> discover root cause and/or solution?
>> --
>> http://www.freelists.org/webpage/oracle-l
>>
>>
>>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Jan 28 2020 - 22:02:59 CET

Original text of this message