Re: intermittent long "log file sync" waits

From: Frits Hoogland <frits.hoogland_at_gmail.com>
Date: Sat, 1 Feb 2020 22:48:34 +0100
Message-Id: <700AA74F-94E6-49A0-960D-E1D20902C8A3_at_gmail.com>



Hi Chris,

Pardon my ignorance, I can’t find an answer to the question if you got measurements of the foreground session encountering the high log file sync waits and the log writer and log writer workers at the same time to see if the log file parallel write wait events are approximately as high? Are you absolutely sure this is not simply IO latency spiking because IO gotten stuck for whatever reason; the backup running would fit that scenario too.

Are you also absolutely sure there isn’t a log switch happening at the same time? What I seen when I skimmed through the thread is you weren’t sure. In all honesty, I do not know (I haven’t tested) what happens when a logswitch happens when the waiting part of a commit happens.

Are you running in multi-tenant mode? I recently found out that the IO wait events are timed differently between multi-tenant and non-multi-tenant. I have no idea why. The change is that with non-multi-tenant is that both (asynchronous) calls are timed in the event, and with multi-tenant the submission of the IO (io_submit) and the non-blocking reap (io_getevents) are not timed, only once the IO is not found ready using the non-blocking call, then a wait event is started and a blocking reap call is done. The reason for telling this is that when you run multi-tenant mode, it’s perfectly possible to see no log file parallel write wait.

The logwriter is a process that sleeps on a semaphore, either times out (every 3 seconds) or gets posted and then performs some housekeeping, and then enters the redo write driver. In it, it reads the public strands, and if it finds unwritten redo, it performs a redo write. If it doesn’t find unwritten redo, it performs a “null write”. In the housekeeping part there is RAC housekeeping which I didn’t look into currently.

In the redo write, it makes sure the redo in the public strand is actually written, SCN adjustments, redo preparation in the public strands, then the actual write and then adjusting the on-disk SCN and posting the processes waiting in log file sync if the instance is in post-wait mode.

If the instance is using the logwriter worker processes (the so-called “adaptive scalable logwriter” mode), the redo write part is executed by the worker processes. It’s that simple.

The reason for detailing this process, is that Oracle, probably since version 12, keeps detailed timing statistics about the different stages of the logwriter write process. It’s important to realise that these statistics are about what is happening in the redo write driver. NOT outside of it. So if it’s spending a lot of time outside of it during housekeeping, you will not see that in these statistics. This is how to to query it:

select b.name, round(b.value/a.value,0) "time (us)" from v$sysstat a, v$sysstat b
where a.name = 'redo writes'
and b.name = 'redo write gather time'
union all
select c.name, round((c.value-b.value)/a.value, 0) from v$sysstat a, v$sysstat b, v$sysstat c where a.name = 'redo writes'
and b.name = 'redo write gather time'
and c.name = 'redo write schedule time'
union all
select d.name, round((d.value-c.value)/a.value, 0) from v$sysstat a, v$sysstat c, v$sysstat d where a.name = 'redo writes'
and c.name = 'redo write schedule time'
and d.name = 'redo write finish time'
union all
select e.name, round((e.value-d.value)/a.value, 0) from v$sysstat a, v$sysstat d, v$sysstat e where a.name = 'redo writes'
and d.name = 'redo write finish time'
and e.name = 'redo write total time'

The different stages are:

1/ gather time: includes making sure the redo in the public strand is done writing, SCN adjustment and redo preparation.
2/ schedule time: includes gather time and all the write preparation until but excluding the actual write
3/ finish time: includes previous times but now includes the actual write calls
4/ total time: includes previous times but now includes posting the foregrounds

This you can actually see if the logwriter or its workers are pushed off CPU for posting the foregrounds.

Frits Hoogland

http://fritshoogland.wordpress.com <http://fritshoogland.wordpress.com/> frits.hoogland_at_gmail.com <mailto:frits.hoogland_at_gmail.com> Mobile: +31 6 14180860

> On 30 Jan 2020, at 21:43, Chris Stephens <cstephens16_at_gmail.com> wrote:
>
> Here's some additional ash output from the trace data posted in last message including event_hist.sql:
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id 1=1 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
> TOTALSECONDS AAS %This USERNAME EVENT2 PROGRAM2 BLOCKING_SESSION BLOCKING_INST_ID FIRST_SEEN LAST_SEEN DIST_SQLEXEC_SEEN
> _______________ ______ __________ ________________ ________________ _____________ ___________________ ___________________ ______________________ ______________________ ____________________
> 2 0.3 50% | SYS ON CPU (sqlplus) 2020-01-29 13:04:24 2020-01-29 13:04:25 1
> 1 0.1 25% | GEN3_MGOWER_3 ON CPU (python) 2020-01-29 13:04:22 2020-01-29 13:04:22 1
> 1 0.1 25% | GEN3_MGOWER_3 log file sync (python) 1710 2 2020-01-29 13:04:20 2020-01-29 13:04:20 1
>
>
> SQL> _at_bg lgwr
> NAME DESCRIPTION SID OPID SPID PADDR SADDR
> _______ ______________ _______ _______ ________ ___________________ ___________________
> LGWR Redo etc. 1710 34 26552 00000001E8718860 00000001D8BFF4A0
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id session_id=1710 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
>
> no rows selected
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id session_id=1710 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> no rows selected
>
> SQL> _at_ash/event_hist "log file" 1=1 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH FIRST_SEEN LAST_SEEN
> ________________ __________________ ___________________ ______________________ ____________________ _______________ _______________ ______________________ ______________________
> log file sync < 1024 1 1.8 1.8 12.8 |# | 2020-01-29 13:04:13 2020-01-29 13:04:13
> log file sync < 4096 1 1 4.1 29.1 |### | 2020-01-29 13:04:12 2020-01-29 13:04:12
> log file sync < 8192 1 1 8.2 58.2 |###### | 2020-01-29 13:04:20 2020-01-29 13:04:20
>
> SQL> _at_ash/event_hist "write" 1=1 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> no rows selected
>
> SQL> _at_ash/event_hist "*" 1=1 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH FIRST_SEEN LAST_SEEN
> _________________ __________________ ___________________ ______________________ ____________________ _______________ _______________ ______________________ ______________________
> log file sync < 1024 1 1.8 1.8 12.8 |# | 2020-01-29 13:04:13 2020-01-29 13:04:13
> log file sync < 4096 1 1 4.1 29.1 |### | 2020-01-29 13:04:12 2020-01-29 13:04:12
> log file sync < 8192 1 1 8.2 58.2 |###### | 2020-01-29 13:04:20 2020-01-29 13:04:20
> row cache lock < 1 1 2469.1 2.5 100 |##########| 2020-01-29 13:04:01 2020-01-29 13:04:01
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id "lower(program) like '%lg%'" "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> no rows selected
>
> i'm going to copy over gv$active_session_history to a single instance database and take a look at ash_wait_chains but i don't expect that to reveal anything either.
>
> i need to confirm whether adding redo resolved issue (will have developer fire up workload) but I think next step is to investigate x$trace and possible process communication issues?
>
> On Thu, Jan 30, 2020 at 10:27 AM Chris Stephens <cstephens16_at_gmail.com <mailto:cstephens16_at_gmail.com>> wrote:
> Hey Tanel,
>
> Thanks for taking time to write this up! Will definitely be attending that webinar.
>
> comments inline
>
> On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel_at_tanelpoder.com <mailto:tanel_at_tanelpoder.com>> wrote:
> Comments:
>
> 1) First, looks like you're using all the right tools for systematic troubleshooting ;-)
>
> ***i try!
>
> 2) As "log file sync" is not directly an I/O wait event (IO happens indirectly under the hood), but "wait for LGWR to ACK/increment redo persisting position" then there may be multiple reasons for delays. I'm repeating some of what others have said here too:
> - LGWR could be waiting for remote ACK (synchronous data guard) - that should show up as a LNS* wait or something like that though
> - Process slowness due to CPU overload/scheduling latency or swapping (in both cases, LGWR may do its work fast, but it takes a while for the foreground process to wake up and end the wait event - you could check if there are any reports regarding switching LGWR mode in alert.log around that time)
> - Bug/missed post due to switching between LGWR progress polling vs. waiting for LGWR post
> - ASM instance communication issues (ASM communication can be on the critical path when creating/changing files on ASM and even for reading, when the ASM metadata cache in your local instance doesn't have what it needs)
> - So, the physical redo log file disk write is not necessarily the biggest contributor to the log file sync wait
>
> *** no data guard in the env.
> *** we have excess CPU. (20 cores / instance and usually only 1-5 active foreground processes but i will confirm from TFA collections)
> *** yesterday i was ready to blame bug. today, not so sure. i checked log file switches with a script google found me a few days ago. now thinking that was inaccurate as a script from metastink shows tones of switching so that could very well be related.
> ***asm communication is something i need to look into. TFA has reported some ASM issues but there's no details and i never saw anything in asm alert logs indicating an issue.
>
> 3) You'll need to measure what LGWR itself was doing during the problem time
> - Since you're on 19c (12c or later) the LGWR has slaves too and they may be doing most of the work
> - You'll need to look into both the LGWR process and any other LG* ones too
> - Since you have (Oracle) Snapper running already and writing to trace you should have the data already
> - The lgwr in the following _at_snapper ash,stats,trace 10 999999 lgwr syntax includes LGWR slaves too
> - As LGWR slaves have their own sessions (and SIDs), you should run ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR and slaves
> - If LGWR and the slaves were listed idle during the problem time (while others waited for log file sync event at the same time) and if there was no huge OS level CPU/memory shortage at the time, this looks like a background <-> foreground communication problem.
> - This (and any potential ASM issues) could probably be troubleshooted via KST tracing, KSR channels and X$TRACE that could help here (some of it is enabled by default ( Like I showed in a previous hacking session:
https://www.youtube.com/watch?v=mkmvZv58W6w <https://www.youtube.com/watch?v=mkmvZv58W6w> )
>
>
> here is all the output i have corresponding to a 6 sec "log file sync" wait:
>
> Experience duration: R = t1 − t0 = 41.336137 seconds <x-msg://1/#m_-5509691095923674063_profile-by-subroutine>
> Experience begin time: t0 = 2020-01-29T13:04:07.560478
> Experience end time: t1 = 2020-01-29T13:04:48.896615
> CALL-NAME DURATION % CALLS MEAN MIN MAX
> ------------------------------ --------- ------ ------ -------- -------- --------
> SQL*Net message from client 26.017105 66.5% 9,886 0.002632 0.000209 7.858051
> log file sync 10.582409 27.1% 150 0.070549 0.000295 6.909711
> EXEC 2.088757 5.3% 9,702 0.000215 0.000000 0.209199
> DLM cross inst call completion 0.109617 0.3% 212 0.000517 0.000005 0.080575
> row cache lock 0.070118 0.2% 125 0.000561 0.000208 0.000972
> FETCH 0.047586 0.1% 1,029 0.000046 0.000000 0.003056
> PARSE 0.045293 0.1% 4,778 0.000009 0.000000 0.002692
> CLOSE 0.033453 0.1% 4,784 0.000007 0.000000 0.000237
> library cache lock 0.023614 0.1% 55 0.000429 0.000152 0.000901
> library cache pin 0.022311 0.1% 55 0.000406 0.000153 0.000689
> 17 others 0.071430 0.2% 10,772 0.000007 0.000000 0.001487
> ------------------------------ --------- ------ ------ -------- -------- --------
> TOTAL (27) 39.111693 100.0% 41,548 0.000941 0.000000 7.858051
>
> 6 sec wait from raw data:
>
> *** 2020-01-29T13:04:20.724666-06:00
> WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279 p3=0 obj#=-1 tim=1025228531958
> WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228532118
> WAIT #0: nam='SQL*Net message from client' ela= 1711 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228533852
> BINDS #140204587734864:
>
> i enabled 10046 on lgwr:
>
> *** 2020-01-29T13:04:20.725994-06:00
> WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228533313
> WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17 requests=2 obj#=-1 tim=1025228533777
> WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3 requests=2 obj#=-1 tim=1025228534104
> WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0 p3=0 obj#=-1 tim=1025228534182
> WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2 requests=2 obj#=-1 tim=1025228534437
> WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2 requests=2 obj#=-1 tim=1025228534721
> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228534786
> WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228538038
> WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2 requests=1 obj#=-1 tim=1025228538297
> WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228538358
> WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228550204
> WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025228550294
> WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7 requests=1 obj#=-1 tim=1025228550624
> WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1 tim=1025228550699
> WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1 tim=1025228567467
> WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9 requests=1 obj#=-1 tim=1025228567762
> WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1 tim=1025228567825
> WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1 tim=1025228584091
> WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9 requests=1 obj#=-1 tim=1025228584400
> WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1 tim=1025228584502
> WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1 tim=1025228600993
> WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9 requests=1 obj#=-1 tim=1025228601393
> WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1 tim=1025228601482
> WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1 tim=1025228611053
> WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228618574
> WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025228618636
> WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9 requests=1 obj#=-1 tim=1025228618960
> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228619066
> WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228635045
> WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9 requests=1 obj#=-1 tim=1025228635427
> WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1 tim=1025228635525
> WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0 obj#=-1 tim=1025228651537
> WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9 requests=1 obj#=-1 tim=1025228651912
> WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1 tim=1025228652034
> WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0 obj#=-1 tim=1025228668405
> WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9 requests=1 obj#=-1 tim=1025228668804
> WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1 tim=1025228668911
> WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0 obj#=-1 tim=1025228685907
> WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9 requests=1 obj#=-1 tim=1025228686327
> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1 tim=1025228686453
> WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0 obj#=-1 tim=1025228693390
> WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2 requests=1 obj#=-1 tim=1025228693768
> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1 tim=1025228693862j
> ...more of the same
> 264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0 obj#=-1 tim=1025231090742
> 264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0 p3=0 obj#=-1 tim=1025231110227
> 264398 WAIT #0: nam='log file parallel write' ela= 272 files=1 blocks=9 requests=1 obj#=-1 tim=1025231110633
> 264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0 obj#=-1 tim=1025231110748
> 264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0 p3=0 obj#=-1 tim=1025231126261
> 264401 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=4 requests=1 obj#=-1 tim=1025231126641
> 264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0 obj#=-1 tim=1025231126753
> 264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0 p3=0 obj#=-1 tim=1025231143514
> 264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025231143603
> 264405 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=9 requests=1 obj#=-1 tim=1025231143976
> 264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231144093
> 264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231149515
> 264408 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=4 requests=1 obj#=-1 tim=1025231149889
> 264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231149989
> 264410
> 264411 *** 2020-01-29T13:04:23.802774-06:00
> 264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0 p3=0 obj#=-1 tim=1025231610086
> 264413
> 264414 *** 2020-01-29T13:04:26.797925-06:00
> 264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0 p3=0 obj#=-1 tim=1025234605238
> 264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0 obj#=-1 tim=1025234616054
> 264417
>
> i will get you snapper lgwr output associated w/ this interval as soon as i can find it (can't remember where i stashed it and lost trace file name .. fml)
>
> 4) Paste the full Snapper output (of one time snapshot from the problem time) here if you can - we'll be able to see what the slaves were doing
> - You'd need to include not only LGWR SID but also any LG* slave SIDs too
> - Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond just wait events
> - Thinking of metrics like these - snapper already handles them (if you look into the rightmost column)
> - redo synch time overhead (usec) - "FG wakeup overhead per log file sync"
> - redo write time - accounting end to end redo write time (not just the log file parallel write syscall) "per redo write"
>
> 5) Regarding the ASH wait chains error on 19c - I'm aware of this but forgot to look into it
> - It has (probably) something to do with GV$ views (or PX slaves used for PDB V$ queries) and possibly the connect by I'm using
> - Does it work in the CDB (if multitenant)?
> - Does it work with lower optimizer features enabled?
> - As we are already manually looking into what LGWR and its slaves are doing, there's no need for the wait chains script here
>
> ***oracle has filed a bug. altering optimizer_features_enabled doesn't help.
>
> 6) Linux Process Snapper - pSnapper
> - If you want to see the full "wait" states of Linux processes/threads, use the -a option.
> - It's somewhat like always showing "idle" threads too
> - But for linux apps it's harder to determine what's idle and what's some network RPC wait, so I currently just do what Linux "load" metric does
> - I just shows threads in Runnable and Disk/Uninterruptible sleep states by default (-a shows all states, including Sleep and kernel-Idle threads etc)
> - So the Linux pSnapper "idle" means a somewhat different thing than Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not be shown as Idle in Oracle, but at the Linux process level it's "just some network socket wait" and you'd need to use -a option to show all of them
>
> *** will add that option to future collections (and watch the video/read help)
>
> 7) Were LGWR/LG* processes mostly idle during the problem time?
> - As I kept reading the thread, it looks like ASH reported LGWR being mostly idle during the problem time
> - This is not exactly the same pattern, but I've seen cases in the past where everyone's waiting for LGWR and LGWR doesn't seem to be busy and doesn't show any significant waits
> - It turned out that LGWR was actually waiting for CKPT to make progress, but it wasn't synchronously waiting, but just going to sleep and "checking again later" in a loop
> - I wrote about this here: https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/ <https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/>
> - So it may still make sense to check what LGWR/LG* processes are doing even if they don't seem to be active much (like do you see an occasional controlfile read wait by LGWR etc)
> - SQL*Trace would capture all activity of LGWR, but may slow things down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that keeps a log of last 10 waits of every session in memory)
>
> *** i think i've addressed most of this.
>
> Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan _at_ 12pm ET) about complex Oracle performance problems that span multiple layers and have multiple moving parts. I can add some LGWR troubleshooting magic in the end of that (here goes my plan to fit the session in 1.5h again!). I'll post the details in my blog in a few minutes.
>
> *** can't wait!!
> --
> Tanel Poder
> https://blog.tanelpoder.com/seminar <https://blog.tanelpoder.com/seminar>
>
>
> On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com <mailto:cstephens16_at_gmail.com>> wrote:
> doesn't look like any ash data for lgwr:
>
> SQL> _at_ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION EVENT2 FIRST_SEEN LAST_SEEN DIST_SQLEXEC_SEEN
> _______________ ______ __________ _____________ ___________________ _____________________ ______________________ ______________________ ____________________
> 81 0.7 82% | 1713 1710 log file sync 2020-01-27 15:50:54 2020-01-27 15:52:14 1
> 5 0 5% | 1221 ON CPU 2020-01-27 15:50:59 2020-01-27 15:52:14 1
> 3 0 3% | 246 ON CPU 2020-01-27 15:51:00 2020-01-27 15:52:00 1
> 3 0 3% | 1713 ON CPU 2020-01-27 15:50:51 2020-01-27 15:52:27 2
> 1 0 1% | 123 ON CPU 2020-01-27 15:51:35 2020-01-27 15:51:35 1
> 1 0 1% | 127 ON CPU 2020-01-27 15:51:50 2020-01-27 15:51:50 1
> 1 0 1% | 252 2321 latch free 2020-01-27 15:52:28 2020-01-27 15:52:28 1
> 1 0 1% | 978 ges remote message 2020-01-27 15:50:48 2020-01-27 15:50:48 1
> 1 0 1% | 983 latch free 2020-01-27 15:52:28 2020-01-27 15:52:28 1
> 1 0 1% | 1713 library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48 1
> 1 0 1% | 1831 ON CPU 2020-01-27 15:50:49 2020-01-27 15:50:49 1
>
>
> 11 rows selected.
>
>
> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710 "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>
> no rows selected
>
> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <nenad.noveljic_at_vontobel.com <mailto:nenad.noveljic_at_vontobel.com>> wrote:
> You’re using v0.2, but v0.6 should be the latest version. 0.2 wouldn’t show idle blockers.
>
> I looked at the ashtop output once again. Strange that lgwr processes nor any other bg processes show up significantly there. The question is, what was lgwr doing while the fg processes were waiting on him.
>

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Feb 01 2020 - 22:48:34 CET

Original text of this message