Re: intermittent long "log file sync" waits

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Mon, 10 Feb 2020 20:02:25 -0500
Message-ID: <CAMHX9JLufHOp94gH2VmAXqAvVgr=QbWRn1orVzy0y7rWC=aeHA_at_mail.gmail.com>

The X$TRACE is a bunch of ring buffers and your process'es buffer may wrap/get overwritten really quickly, if there's high frequency tracing going on due to some other component.

The krcp.c_at_750 & krcput entries you see in the output are from block change tracking code (writing changed block file/block addresses into block change tracking buffer).

SQL> *_at_oddc* krc
ORADEBUG DOC COMPONENT | grep -i krc

  KRC                          *Recovery Block Change Tracking (krc)*
  KRC_CHANGES                  Recovery Block Change Tracking CHANGES

((null))

So it looks like the transaction processing proceeded after the long log file sync and the block change tracking code that's enabled by default overwrote the KSL WAIT and POST entries by the time you got to save the X$TRACE data. It's worth checking (in a test session) just in case whether enabling the KST 10005:4 tracing does record waits and posts in your DB.

If the KST bucket data is overwritten really fast, you can bind a "kstdumpallprocs" dump to any log file sync event that takes more than say 500000 us (in your session). This will make log file syncs slightly slower in the session where it's enabled, but the dump happens only when the wait took at least 0.5 seconds. Test it out if it kicks in first, could use DBMS_LOCK.SLEEP / "PL/SQL lock timer" wait event first:

ALTER SESSION SET EVENTS 'wait_event["log file sync"] {wait: minwait=500000} kstdumpallprocs(1)';

This would dump the KST trace buffers (X$TRACE) for all processes. You may have seen such trace entries in systemstate/errorstack dumps before, this is the X$TRACE dump:

Trace Bucket Dump Begin: default bucket for process 2 (osid: 27409, PMON) TIME(*=approx):SEQ:COMPONENT:FILE_at_LINE:FUNCTION:SECT/DUMP:SID:SERIAL#: [EVENT#:PID] DATA

I've described the *wait_event[]* tracing and conditional dumping here:

https://tanelpoder.com/posts/what-caused-this-wait-event/

Note that I mention the *minwait* unit being milliseconds in my blog post, but it should be in microseconds instead (will fix the typo).

You might also want to increase *_trace_buffers* for backgrounds in case LGWR keeps working on while your foreground process is waking up to finish up the log file sync wait.

As you mentioned LGWR priorities, if you want to ensure higher priority for LGWR on Oracle 19c (on Linux), you'd probably want to use *_highest_priority_processes* parameter for LGWR (not recommending for any random process, but LGWR is special). But I think we already checked that there was no CPU shortage (have already lost track too :-) ... of course there could be an occasional very short spike of CPU activity that doesn't show up with regular tools. And you need to use "*LG**" not LGWR, otherwise LGWR doesn't pick this setting up. You can verify your current priorities:

ps -eo pid,comm,cls,pri | grep _lg

In my non-RAC DB:

_highest_priority_processes = "VKTM|LG*"

I get this in my DB (have set LG* only for the LINPRD db):

$ ps -eo pid,comm,cls,pri | grep _lg

27448 ora_lgwr_linprd  RR  41
27454 ora_lg00_linprd  RR  41
27460 ora_lg01_linprd  RR  41
28062 ora_lgwr_xe      TS  19
28066 ora_lg00_xe      TS  19
28070 ora_lg01_xe      TS  19
28343 ora_lgwr_orclcd  TS  19
28347 ora_lg00_orclcd  TS  19
28351 ora_lg01_orclcd  TS  19

--
Tanel Poder
https://tanelpoder.com/seminar


On Mon, Feb 10, 2020 at 5:18 PM Chris Stephens <cstephens16_at_gmail.com>
wrote:

> I was able to capture x$trace output from session w/ 10005:4 for a session
> experiencing > 1 sec LFS wait.
>
> $ awk '$0 ~ /^*** SESSION ID/ || ($7 > 1000000 && $0 ~ /log file sync/)
> {print $0}' lsst2db3_ora_1100.trc
> *** SESSION ID:(8.36700) 2020-02-10T14:26:43.346885-06:00
> WAIT #0: nam='log file sync' ela= 1165317 buffer#=18314 sync
> scn=1014204133 p3=0 obj#=-1 tim=534257573876
>
> however i don't see rows with "data like '%log file%':
>
> SQL> select count(*) from xt_snaps where sid = 8 and serial# = 36700;
>    COUNT(*)
> ___________
>          88
>
> SQL> select count(*) from xt_snaps where sid = 8 and serial# = 36700 and
> data like '%log file%';
>    COUNT(*)
> ___________
>           0
>
> Here is a gist of all data corresponding to that sid,serial# (i didn't see
> anything interesting):
>
> https://gist.github.com/cs2018ncsa/b2a4e235dc68f6fe240461f4d2035e96
>
> I ran the following (generated from a script executing every 5 seconds
> looking for new sessions running this workload) to enable that 10005:4
> event so i think i got things right:
>
> alter tracing enable "10005:4:8";
>
> Oracle support recommended we set the following and test:
>
> Here are additional steps I would like you to try.
>
> alter system set "_high_priority_processes"='LMS*|LGWR|PMON'
> scope=spfile;ex
> ALTER SYSTEM SET "_use_single_log_writer"=TRUE scope=spfile; <<-- restart
> required.
> ALTER SYSTEM SET "_use_adaptive_log_file_sync"= false;
>
>
> we did that and still experienced the long LFS.
>
> not sure where to go from here except to the fridge to grab a beer.
>

--
http://www.freelists.org/webpage/oracle-l

Received on Tue Feb 11 2020 - 02:02:25 CET

This message: [ Message body ]
Next message: Mladen Gogala: "Re: Hugepages usage"
Previous message: Keith Moore: "Re: Hugepages usage"
In reply to: Chris Stephens: "Re: intermittent long "log file sync" waits"
Next in thread: Ls Cheng: "Re: RE: convert big endian to medium endian"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message