Re: intermittent long "log file sync" waits

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Wed, 29 Jan 2020 23:32:41 -0500
Message-ID: <CAMHX9J+FooJWWr896aFx7oqeMk=PKA__GaHjxjU_gsOyE1Tzjg_at_mail.gmail.com>



Comments:

  1. First, looks like you're using all the right tools for systematic troubleshooting ;-)
  2. As "log file sync" is not directly an I/O wait event (IO happens indirectly under the hood), but "wait for LGWR to ACK/increment redo persisting position" then there may be multiple reasons for delays. I'm repeating some of what others have said here too:
    • LGWR could be waiting for remote ACK (synchronous data guard) - that should show up as a LNS* wait or something like that though
    • Process slowness due to CPU overload/scheduling latency or swapping (in both cases, LGWR may do its work fast, but it takes a while for the foreground process to wake up and end the wait event - you could check if there are any reports regarding switching LGWR mode in alert.log around that time)
    • Bug/missed post due to switching between LGWR progress polling vs. waiting for LGWR post
    • ASM instance communication issues (ASM communication can be on the critical path when creating/changing files on ASM and even for reading, when the ASM metadata cache in your local instance doesn't have what it needs)
    • So, the physical redo log file disk write is not necessarily the biggest contributor to the log file sync wait
  3. You'll need to measure what LGWR itself was doing during the problem time
  4. Paste the full *Snapper* output (of one time snapshot from the problem time) here if you can - we'll be able to see what the slaves were doing
    • You'd need to include not only LGWR SID but also any LG* slave SIDs too
    • Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond just wait events
    • Thinking of metrics like these - snapper already handles them (if you look into the rightmost column)
      • *redo synch time overhead (usec)* - "FG wakeup overhead per log file sync"
      • *redo write time* - accounting end to end redo write time (not just the log file parallel write syscall) "per redo write"
  5. Regarding the *ASH wait chains* error on 19c - I'm aware of this but forgot to look into it
    • It has (probably) something to do with GV$ views (or PX slaves used for PDB V$ queries) and possibly the connect by I'm using
    • Does it work in the CDB (if multitenant)?
    • Does it work with lower optimizer features enabled?
    • As we are already manually looking into what LGWR and its slaves are doing, there's no need for the wait chains script here
  6. Linux Process Snapper - pSnapper
  7. Were LGWR/LG* processes mostly idle during the problem time?

Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan _at_ 12pm ET) about complex Oracle performance problems that span multiple layers and have multiple moving parts. I can add some LGWR troubleshooting magic in the end of that (here goes my plan to fit the session in 1.5h again!). I'll post the details in my blog in a few minutes.

--
Tanel Poder
https://blog.tanelpoder.com/seminar


On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com>
wrote:


> doesn't look like any ash data for lgwr:
>
> SQL> _at_ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27
> 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION
> EVENT2 FIRST_SEEN LAST_SEEN
> DIST_SQLEXEC_SEEN
> _______________ ______ __________ _____________ ___________________
> _____________________ ______________________ ______________________
> ____________________
> 81 0.7 82% | 1713 1710 log
> file sync 2020-01-27 15:50:54 2020-01-27 15:52:14
> 1
> 5 0 5% | 1221 ON CPU
> 2020-01-27 15:50:59 2020-01-27 15:52:14
> 1
> 3 0 3% | 246 ON CPU
> 2020-01-27 15:51:00 2020-01-27 15:52:00
> 1
> 3 0 3% | 1713 ON CPU
> 2020-01-27 15:50:51 2020-01-27 15:52:27
> 2
> 1 0 1% | 123 ON CPU
> 2020-01-27 15:51:35 2020-01-27 15:51:35
> 1
> 1 0 1% | 127 ON CPU
> 2020-01-27 15:51:50 2020-01-27 15:51:50
> 1
> 1 0 1% | 252 2321 latch
> free 2020-01-27 15:52:28 2020-01-27 15:52:28
> 1
> 1 0 1% | 978 ges
> remote message 2020-01-27 15:50:48 2020-01-27 15:50:48
> 1
> 1 0 1% | 983 latch
> free 2020-01-27 15:52:28 2020-01-27 15:52:28
> 1
> 1 0 1% | 1713
> library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48
> 1
> 1 0 1% | 1831 ON CPU
> 2020-01-27 15:50:49 2020-01-27 15:50:49
> 1
> >
> 11 rows selected.
> >
> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710
> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>
> no rows selected
>
> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <
> nenad.noveljic_at_vontobel.com> wrote:
> >> You’re using v0.2, but v0.6 should be the latest version. 0.2 wouldn’t >> show idle blockers. >> >> I looked at the ashtop output once again. Strange that lgwr processes nor >> any other bg processes show up significantly there. The question is, what >> was lgwr doing while the fg processes were waiting on him. >> >> -- http://www.freelists.org/webpage/oracle-l
Received on Thu Jan 30 2020 - 05:32:41 CET

Original text of this message