Re: intermittent long "log file sync" waits

From: Stefan Koehler <contact_at_soocs.de>
Date: Tue, 28 Jan 2020 21:30:13 +0100 (CET)
Message-ID: <937082961.116422.1580243414190_at_ox.hosteurope.de>


Hello Chris,
these are the most fun problems - I love to troubleshoot sporadic LGWR issues :-)

I just read through the whole thread quickly and here are my two cents:

  1. Can you profile the extended SQL trace file with an appropriate profiler (e.g. Method-R Profiler or TVD$XTAT, the latter is free) to get an exact wait event histogram of the log file sync waits for the slow run?
  2. You don't see any ASH data for LGWR process for this particular time frame, right? Might it be possible that the scalable log writer mode kicked in and the redo work is done by the worker processes? Just check the LGWR trace file and you should see which mode (single log writer mode/scalable log writer mode) was used. If it is the scalable one you have to take a look at the lg<nn> processes.
  3. You see "log file parallel write" for your loop workload which might not be a surprise as you need to write redo. However if you see such large latency spikes (basically in the same ball park as the "log file sync" ones - see point 1) it is very likely that you got an OS config problem (e.g. Linux kernel AIO slots, scheduler, device queue, etc.) or a storage problem. You can troubleshoot these "log file parallel write" syscalls incl. wchan with Tanel Poder's psnapper (https://github.com/tanelpoder/psnapper) and see where it is stuck/waiting.

Have fun troubleshooting :-)

Best Regards
Stefan Koehler

Independent Oracle performance consultant and researcher Website: http://www.soocs.de
Twitter: _at_OracleSK

> Chris Stephens <cstephens16_at_gmail.com> hat am 28. Januar 2020 um 17:09 geschrieben:
>
>
> 3-node Oracle 19.3 RAC
> Centos 7
>  
> We have a SQLAlchemy/Python based application workload that is running the exact same steps with widely varying response times which appear to be related to varying "log file sync" wait times.
>  
> Here is a profile of a "fast" run:
>  
> CALL-NAME                        DURATION       %   CALLS      MEAN       MIN        MAX
> ------------------------------  ---------  ------  ------  --------  --------  ---------
> SQL*Net message from client     53.197782   91.8%  10,092  0.005271  0.000177  28.568493
> EXEC                             3.759177    6.5%   9,816  0.000383  0.000000   0.239592
> row cache lock                   0.233153    0.4%     541  0.000431  0.000113   0.000941
> PARSE                            0.140399    0.2%   4,867  0.000029  0.000000   0.006620
> DLM cross inst call completion   0.137330    0.2%     956  0.000144  0.000004   0.000505
> library cache lock               0.100171    0.2%     215  0.000466  0.000151   0.002133
> library cache pin                0.079729    0.1%     216  0.000369  0.000056   0.000710
> FETCH                            0.058253    0.1%   1,062  0.000055  0.000000   0.004148
> log file sync                    0.048217    0.1%     149  0.000324  0.000259   0.000505
> CLOSE                            0.045416    0.1%   4,929  0.000009  0.000000   0.000073
> 20 others                        0.135624    0.2%  11,854  0.000011  0.000000   0.001700
> ------------------------------  ---------  ------  ------  --------  --------  ---------
> TOTAL (30)                      57.935251  100.0%  44,697  0.001296  0.000000  28.568493
>
>  
> Here is a profile of a "slow" run:
>  
> CALL-NAME                         DURATION       %   CALLS      MEAN       MIN         MAX
> ------------------------------  ----------  ------  ------  --------  --------  ----------
> SQL*Net message from client     131.186118   61.0%  10,092  0.012999  0.000212  106.789360
> log file sync                    79.291166   36.8%     150  0.528608  0.000264    2.986575
> EXEC                              3.728402    1.7%   9,816  0.000380  0.000000    0.221403
> row cache lock                    0.248868    0.1%     542  0.000459  0.000111    0.001036
> PARSE                             0.164267    0.1%   4,867  0.000034  0.000000    0.004652
> DLM cross inst call completion    0.146981    0.1%     957  0.000154  0.000005    0.001188
> library cache lock                0.104354    0.0%     218  0.000479  0.000160    0.000728
> library cache pin                 0.082504    0.0%     202  0.000408  0.000157    0.000672
> FETCH                             0.056687    0.0%   1,062  0.000053  0.000000    0.003969
> CLOSE                             0.043590    0.0%   4,929  0.000009  0.000000    0.000180
> 20 others                         0.142044    0.1%  11,866  0.000012  0.000000    0.001792
> ------------------------------  ----------  ------  ------  --------  --------  ----------
> TOTAL (30)                      215.194981  100.0%  44,701  0.004814  0.000000  106.789360
>
>
> The weird thing is that I don't see corresponding log I/O waits (awaits) in iostat output.
>  
> I have a ticket open w/ oracle but does anyone have any suggestions to discover root cause and/or solution?

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Jan 28 2020 - 21:30:13 CET

Original text of this message