Re: intermittent long "log file sync" waits
Date: Mon, 10 Feb 2020 20:02:25 -0500
Message-ID: <CAMHX9JLufHOp94gH2VmAXqAvVgr=QbWRn1orVzy0y7rWC=aeHA_at_mail.gmail.com>
The X$TRACE is a bunch of ring buffers and your process'es buffer may wrap/get overwritten really quickly, if there's high frequency tracing going on due to some other component.
The krcp.c_at_750 & krcput entries you see in the output are from block change tracking code (writing changed block file/block addresses into block change tracking buffer).
SQL> *_at_oddc* krc
ORADEBUG DOC COMPONENT | grep -i krc
KRC *Recovery Block Change Tracking (krc)* KRC_CHANGES Recovery Block Change Tracking CHANGES((null))
So it looks like the transaction processing proceeded after the long log file sync and the block change tracking code that's enabled by default overwrote the KSL WAIT and POST entries by the time you got to save the X$TRACE data. It's worth checking (in a test session) just in case whether enabling the KST 10005:4 tracing does record waits and posts in your DB.
If the KST bucket data is overwritten really fast, you can bind a "kstdumpallprocs" dump to any log file sync event that takes more than say 500000 us (in your session). This will make log file syncs slightly slower in the session where it's enabled, but the dump happens only when the wait took at least 0.5 seconds. Test it out if it kicks in first, could use DBMS_LOCK.SLEEP / "PL/SQL lock timer" wait event first:
ALTER SESSION SET EVENTS 'wait_event["log file sync"] {wait: minwait=500000} kstdumpallprocs(1)';
This would dump the KST trace buffers (X$TRACE) for all processes. You may have seen such trace entries in systemstate/errorstack dumps before, this is the X$TRACE dump:
Trace Bucket Dump Begin: default bucket for process 2 (osid: 27409, PMON) TIME(*=approx):SEQ:COMPONENT:FILE_at_LINE:FUNCTION:SECT/DUMP:SID:SERIAL#: [EVENT#:PID] DATA
I've described the *wait_event[]* tracing and conditional dumping here:
Note that I mention the *minwait* unit being milliseconds in my blog post, but it should be in microseconds instead (will fix the typo).
You might also want to increase *_trace_buffers* for backgrounds in case LGWR keeps working on while your foreground process is waking up to finish up the log file sync wait.
As you mentioned LGWR priorities, if you want to ensure higher priority for LGWR on Oracle 19c (on Linux), you'd probably want to use *_highest_priority_processes* parameter for LGWR (not recommending for any random process, but LGWR is special). But I think we already checked that there was no CPU shortage (have already lost track too :-) ... of course there could be an occasional very short spike of CPU activity that doesn't show up with regular tools. And you need to use "*LG**" not LGWR, otherwise LGWR doesn't pick this setting up. You can verify your current priorities:
- ps -eo pid,comm,cls,pri | grep _lg
In my non-RAC DB:
- _highest_priority_processes = "VKTM|LG*"
I get this in my DB (have set LG* only for the LINPRD db):
$ ps -eo pid,comm,cls,pri | grep _lg
27448 ora_lgwr_linprd RR 41 27454 ora_lg00_linprd RR 41 27460 ora_lg01_linprd RR 41 28062 ora_lgwr_xe TS 19 28066 ora_lg00_xe TS 19 28070 ora_lg01_xe TS 19 28343 ora_lgwr_orclcd TS 19 28347 ora_lg00_orclcd TS 19 28351 ora_lg01_orclcd TS 19
-- Tanel Poder https://tanelpoder.com/seminar On Mon, Feb 10, 2020 at 5:18 PM Chris Stephens <cstephens16_at_gmail.com> wrote: > I was able to capture x$trace output from session w/ 10005:4 for a session > experiencing > 1 sec LFS wait. > > $ awk '$0 ~ /^*** SESSION ID/ || ($7 > 1000000 && $0 ~ /log file sync/) > {print $0}' lsst2db3_ora_1100.trc > *** SESSION ID:(8.36700) 2020-02-10T14:26:43.346885-06:00 > WAIT #0: nam='log file sync' ela= 1165317 buffer#=18314 sync > scn=1014204133 p3=0 obj#=-1 tim=534257573876 > > however i don't see rows with "data like '%log file%': > > SQL> select count(*) from xt_snaps where sid = 8 and serial# = 36700; > COUNT(*) > ___________ > 88 > > SQL> select count(*) from xt_snaps where sid = 8 and serial# = 36700 and > data like '%log file%'; > COUNT(*) > ___________ > 0 > > Here is a gist of all data corresponding to that sid,serial# (i didn't see > anything interesting): > > https://gist.github.com/cs2018ncsa/b2a4e235dc68f6fe240461f4d2035e96 > > I ran the following (generated from a script executing every 5 seconds > looking for new sessions running this workload) to enable that 10005:4 > event so i think i got things right: > > alter tracing enable "10005:4:8"; > > Oracle support recommended we set the following and test: > > Here are additional steps I would like you to try. > > alter system set "_high_priority_processes"='LMS*|LGWR|PMON' > scope=spfile;ex > ALTER SYSTEM SET "_use_single_log_writer"=TRUE scope=spfile; <<-- restart > required. > ALTER SYSTEM SET "_use_adaptive_log_file_sync"= false; > > > we did that and still experienced the long LFS. > > not sure where to go from here except to the fridge to grab a beer. > -- http://www.freelists.org/webpage/oracle-lReceived on Tue Feb 11 2020 - 02:02:25 CET