Re: intermittent long "log file sync" waits

From: Frits Hoogland <frits.hoogland_at_gmail.com>
Date: Sun, 2 Feb 2020 13:51:14 +0100
Message-Id: <5571F2DD-2C37-44EB-AAF6-A692E0779AD1_at_gmail.com>



I didn’t look that deep, I tried to get an idea of the situation and the current state.

This, as others have concluded already, then looks like the bug that seems to surface every once in a while where sessions that use post-wait for waiting for the logwriter get “forgotten” when the instance switches to polling. The switching is visible, or at least was visible when I looked at it in versions 11.2 and 12.1.0.2, by looking at the logwriter trace file, it mentions when the logwriter switches between post-wait and polling and back. However, in these cases, the process would way longer, until the logwriter decided to switch back to post-wait and found the waiting sessions again.

When this is a bug with the above described issue, of course it needs to be fixed. However, there is a way around it:

- In post-wait mode, a process waiting in ‘log file sync’ can only end the wait if two conditions are met: it needs to be posted and the on_disk_scn must be beyond the process’ commit SCN. Based on what is shown, I assume the logwriter happily gone on and had the on disk scn updated (it does that always every 3 seconds) probably beyond the process’ commit scn. So the most logical thing the process is waiting for, is to be posted, which did not happen.
- In polling mode, a process will wait for a pre-computed time, during which it can not be notified, and then check for the on disk scn. The pre-computed time in normal circumstances is really low, probably around a millisecond, but computed based on the latency of your system. The important bit is there is no need for the logwriter to do anything: the waiting is based on the progress that the logwriter makes by means of the on disk scn, but it doesn’t require additional work, like being posted. 
- Therefore, if you fix your instances to use polling only, it can not come in the situation of a post-wait process being forgotten. This can be achieved by setting “_use_adaptive_log_file_sync” to polling_only. This is an underscore parameter, and therefore you have to ask blessing from oracle support to use it, but since you already got a SR open, this should be simple to do.

I can’t think of any downsides of using polling_only. Maybe someone can? Even the use of it, since it’s switched to dynamically by all supported versions, is high.

Frits Hoogland

http://fritshoogland.wordpress.com <http://fritshoogland.wordpress.com/> frits.hoogland_at_gmail.com <mailto:frits.hoogland_at_gmail.com> Mobile: +31 6 14180860

> On 2 Feb 2020, at 04:50, Tanel Poder <tanel_at_tanelpoder.com> wrote:
>
> But your trace file here already indicates that there's some glitch regarding LGWR wakeups/posting:
>
> Foreground:
>
> *** 2020-01-29T13:04:20.724666-06:00
> WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279 p3=0 obj#=-1 tim=1025228531958
>
> LGWR at the same time:
>
> *** 2020-01-29T13:04:20.725994-06:00
> WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228533313
>
> The WAIT lines are printed out when the wait ends (and the *** timestamp line just before the WAIT line is printed out, if enough time has passed since the last *** line). So this shows that LGWR had happily been idle, waiting for its idle 3-second timeout to expire, or someone to wake it up (that's what "rdbms ipc message" means) - while someone had issued a commit and was waiting for LGWR acknowledgment. So this shows that LGWR didn't even get the post from your session, nor any other sessions (if you had concurrent transactional activity). Back in 11g, typically these problems happened when adaptive log file syncing switch between poll and post/wait wakeup method happened and you hit a bug. But in 12c there are more moving parts.
>
> --
> Tanel Poder
> https://blog.tanelpoder.com/seminar <https://blog.tanelpoder.com/seminar>
> On Sat, Feb 1, 2020 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com <mailto:tanel_at_tanelpoder.com>> wrote:
> Hi Chris,
>
> As 12c+ has log writer slaves and Oracle can adaptavely switch from single LGWR to LGWR + LGxx slave (_use_single_log_writer=adaptive), can you check ASH data for the LGxx slaves at this time too. Not all process-to-process interactions/waits are instrumented with a blocking_session in Oracle...
>
> Should be this command:
>
> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id "program like '%(LG%)'" "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> --
> Tanel Poder
> https://blog.tanelpoder.com/seminar <https://blog.tanelpoder.com/seminar>
>
>
> On Thu, Jan 30, 2020 at 3:44 PM Chris Stephens <cstephens16_at_gmail.com <mailto:cstephens16_at_gmail.com>> wrote:
> Here's some additional ash output from the trace data posted in last message including event_hist.sql:
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id 1=1 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
> TOTALSECONDS AAS %This USERNAME EVENT2 PROGRAM2 BLOCKING_SESSION BLOCKING_INST_ID FIRST_SEEN LAST_SEEN DIST_SQLEXEC_SEEN
> _______________ ______ __________ ________________ ________________ _____________ ___________________ ___________________ ______________________ ______________________ ____________________
> 2 0.3 50% | SYS ON CPU (sqlplus) 2020-01-29 13:04:24 2020-01-29 13:04:25 1
> 1 0.1 25% | GEN3_MGOWER_3 ON CPU (python) 2020-01-29 13:04:22 2020-01-29 13:04:22 1
> 1 0.1 25% | GEN3_MGOWER_3 log file sync (python) 1710 2 2020-01-29 13:04:20 2020-01-29 13:04:20 1
>
>
> SQL> _at_bg lgwr
> NAME DESCRIPTION SID OPID SPID PADDR SADDR
> _______ ______________ _______ _______ ________ ___________________ ___________________
> LGWR Redo etc. 1710 34 26552 00000001E8718860 00000001D8BFF4A0
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id session_id=1710 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
>
> no rows selected
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id session_id=1710 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> no rows selected
>
> SQL> _at_ash/event_hist "log file" 1=1 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH FIRST_SEEN LAST_SEEN
> ________________ __________________ ___________________ ______________________ ____________________ _______________ _______________ ______________________ ______________________
> log file sync < 1024 1 1.8 1.8 12.8 |# | 2020-01-29 13:04:13 2020-01-29 13:04:13
> log file sync < 4096 1 1 4.1 29.1 |### | 2020-01-29 13:04:12 2020-01-29 13:04:12
> log file sync < 8192 1 1 8.2 58.2 |###### | 2020-01-29 13:04:20 2020-01-29 13:04:20
>
> SQL> _at_ash/event_hist "write" 1=1 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> no rows selected
>
> SQL> _at_ash/event_hist "*" 1=1 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH FIRST_SEEN LAST_SEEN
> _________________ __________________ ___________________ ______________________ ____________________ _______________ _______________ ______________________ ______________________
> log file sync < 1024 1 1.8 1.8 12.8 |# | 2020-01-29 13:04:13 2020-01-29 13:04:13
> log file sync < 4096 1 1 4.1 29.1 |### | 2020-01-29 13:04:12 2020-01-29 13:04:12
> log file sync < 8192 1 1 8.2 58.2 |###### | 2020-01-29 13:04:20 2020-01-29 13:04:20
> row cache lock < 1 1 2469.1 2.5 100 |##########| 2020-01-29 13:04:01 2020-01-29 13:04:01
>
> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id "lower(program) like '%lg%'" "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29 13:05:00'"
>
> no rows selected
>
> i'm going to copy over gv$active_session_history to a single instance database and take a look at ash_wait_chains but i don't expect that to reveal anything either.
>
> i need to confirm whether adding redo resolved issue (will have developer fire up workload) but I think next step is to investigate x$trace and possible process communication issues?
>
> On Thu, Jan 30, 2020 at 10:27 AM Chris Stephens <cstephens16_at_gmail.com <mailto:cstephens16_at_gmail.com>> wrote:
> Hey Tanel,
>
> Thanks for taking time to write this up! Will definitely be attending that webinar.
>
> comments inline
>
> On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel_at_tanelpoder.com <mailto:tanel_at_tanelpoder.com>> wrote:
> Comments:
>
> 1) First, looks like you're using all the right tools for systematic troubleshooting ;-)
>
> ***i try!
>
> 2) As "log file sync" is not directly an I/O wait event (IO happens indirectly under the hood), but "wait for LGWR to ACK/increment redo persisting position" then there may be multiple reasons for delays. I'm repeating some of what others have said here too:
> - LGWR could be waiting for remote ACK (synchronous data guard) - that should show up as a LNS* wait or something like that though
> - Process slowness due to CPU overload/scheduling latency or swapping (in both cases, LGWR may do its work fast, but it takes a while for the foreground process to wake up and end the wait event - you could check if there are any reports regarding switching LGWR mode in alert.log around that time)
> - Bug/missed post due to switching between LGWR progress polling vs. waiting for LGWR post
> - ASM instance communication issues (ASM communication can be on the critical path when creating/changing files on ASM and even for reading, when the ASM metadata cache in your local instance doesn't have what it needs)
> - So, the physical redo log file disk write is not necessarily the biggest contributor to the log file sync wait
>
> *** no data guard in the env.
> *** we have excess CPU. (20 cores / instance and usually only 1-5 active foreground processes but i will confirm from TFA collections)
> *** yesterday i was ready to blame bug. today, not so sure. i checked log file switches with a script google found me a few days ago. now thinking that was inaccurate as a script from metastink shows tones of switching so that could very well be related.
> ***asm communication is something i need to look into. TFA has reported some ASM issues but there's no details and i never saw anything in asm alert logs indicating an issue.
>
> 3) You'll need to measure what LGWR itself was doing during the problem time
> - Since you're on 19c (12c or later) the LGWR has slaves too and they may be doing most of the work
> - You'll need to look into both the LGWR process and any other LG* ones too
> - Since you have (Oracle) Snapper running already and writing to trace you should have the data already
> - The lgwr in the following _at_snapper ash,stats,trace 10 999999 lgwr syntax includes LGWR slaves too
> - As LGWR slaves have their own sessions (and SIDs), you should run ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR and slaves
> - If LGWR and the slaves were listed idle during the problem time (while others waited for log file sync event at the same time) and if there was no huge OS level CPU/memory shortage at the time, this looks like a background <-> foreground communication problem.
> - This (and any potential ASM issues) could probably be troubleshooted via KST tracing, KSR channels and X$TRACE that could help here (some of it is enabled by default ( Like I showed in a previous hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w <https://www.youtube.com/watch?v=mkmvZv58W6w> )
>
>
> here is all the output i have corresponding to a 6 sec "log file sync" wait:
>
> Experience duration: R = t1 − t0 = 41.336137 seconds <x-msg://3/#m_-8103031405908768729_m_7542945560652427325_m_-5509691095923674063_profile-by-subroutine>
> Experience begin time: t0 = 2020-01-29T13:04:07.560478
> Experience end time: t1 = 2020-01-29T13:04:48.896615
> CALL-NAME DURATION % CALLS MEAN MIN MAX
> ------------------------------ --------- ------ ------ -------- -------- --------
> SQL*Net message from client 26.017105 66.5% 9,886 0.002632 0.000209 7.858051
> log file sync 10.582409 27.1% 150 0.070549 0.000295 6.909711
> EXEC 2.088757 5.3% 9,702 0.000215 0.000000 0.209199
> DLM cross inst call completion 0.109617 0.3% 212 0.000517 0.000005 0.080575
> row cache lock 0.070118 0.2% 125 0.000561 0.000208 0.000972
> FETCH 0.047586 0.1% 1,029 0.000046 0.000000 0.003056
> PARSE 0.045293 0.1% 4,778 0.000009 0.000000 0.002692
> CLOSE 0.033453 0.1% 4,784 0.000007 0.000000 0.000237
> library cache lock 0.023614 0.1% 55 0.000429 0.000152 0.000901
> library cache pin 0.022311 0.1% 55 0.000406 0.000153 0.000689
> 17 others 0.071430 0.2% 10,772 0.000007 0.000000 0.001487
> ------------------------------ --------- ------ ------ -------- -------- --------
> TOTAL (27) 39.111693 100.0% 41,548 0.000941 0.000000 7.858051
>
> 6 sec wait from raw data:
>
> *** 2020-01-29T13:04:20.724666-06:00
> WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279 p3=0 obj#=-1 tim=1025228531958
> WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228532118
> WAIT #0: nam='SQL*Net message from client' ela= 1711 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228533852
> BINDS #140204587734864:
>
> i enabled 10046 on lgwr:
>
> *** 2020-01-29T13:04:20.725994-06:00
> WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228533313
> WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17 requests=2 obj#=-1 tim=1025228533777
> WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3 requests=2 obj#=-1 tim=1025228534104
> WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0 p3=0 obj#=-1 tim=1025228534182
> WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2 requests=2 obj#=-1 tim=1025228534437
> WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2 requests=2 obj#=-1 tim=1025228534721
> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228534786
> WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228538038
> WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2 requests=1 obj#=-1 tim=1025228538297
> WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228538358
> WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228550204
> WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025228550294
> WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7 requests=1 obj#=-1 tim=1025228550624
> WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1 tim=1025228550699
> WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1 tim=1025228567467
> WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9 requests=1 obj#=-1 tim=1025228567762
> WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1 tim=1025228567825
> WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1 tim=1025228584091
> WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9 requests=1 obj#=-1 tim=1025228584400
> WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1 tim=1025228584502
> WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1 tim=1025228600993
> WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9 requests=1 obj#=-1 tim=1025228601393
> WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1 tim=1025228601482
> WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1 tim=1025228611053
> WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228618574
> WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025228618636
> WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9 requests=1 obj#=-1 tim=1025228618960
> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228619066
> WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228635045
> WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9 requests=1 obj#=-1 tim=1025228635427
> WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1 tim=1025228635525
> WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0 obj#=-1 tim=1025228651537
> WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9 requests=1 obj#=-1 tim=1025228651912
> WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1 tim=1025228652034
> WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0 obj#=-1 tim=1025228668405
> WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9 requests=1 obj#=-1 tim=1025228668804
> WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1 tim=1025228668911
> WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0 obj#=-1 tim=1025228685907
> WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9 requests=1 obj#=-1 tim=1025228686327
> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1 tim=1025228686453
> WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0 obj#=-1 tim=1025228693390
> WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2 requests=1 obj#=-1 tim=1025228693768
> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1 tim=1025228693862j
> ...more of the same
> 264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0 obj#=-1 tim=1025231090742
> 264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0 p3=0 obj#=-1 tim=1025231110227
> 264398 WAIT #0: nam='log file parallel write' ela= 272 files=1 blocks=9 requests=1 obj#=-1 tim=1025231110633
> 264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0 obj#=-1 tim=1025231110748
> 264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0 p3=0 obj#=-1 tim=1025231126261
> 264401 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=4 requests=1 obj#=-1 tim=1025231126641
> 264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0 obj#=-1 tim=1025231126753
> 264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0 p3=0 obj#=-1 tim=1025231143514
> 264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025231143603
> 264405 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=9 requests=1 obj#=-1 tim=1025231143976
> 264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231144093
> 264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231149515
> 264408 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=4 requests=1 obj#=-1 tim=1025231149889
> 264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231149989
> 264410
> 264411 *** 2020-01-29T13:04:23.802774-06:00
> 264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0 p3=0 obj#=-1 tim=1025231610086
> 264413
> 264414 *** 2020-01-29T13:04:26.797925-06:00
> 264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0 p3=0 obj#=-1 tim=1025234605238
> 264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0 obj#=-1 tim=1025234616054
> 264417
>
> i will get you snapper lgwr output associated w/ this interval as soon as i can find it (can't remember where i stashed it and lost trace file name .. fml)
>
> 4) Paste the full Snapper output (of one time snapshot from the problem time) here if you can - we'll be able to see what the slaves were doing
> - You'd need to include not only LGWR SID but also any LG* slave SIDs too
> - Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond just wait events
> - Thinking of metrics like these - snapper already handles them (if you look into the rightmost column)
> - redo synch time overhead (usec) - "FG wakeup overhead per log file sync"
> - redo write time - accounting end to end redo write time (not just the log file parallel write syscall) "per redo write"
>
> 5) Regarding the ASH wait chains error on 19c - I'm aware of this but forgot to look into it
> - It has (probably) something to do with GV$ views (or PX slaves used for PDB V$ queries) and possibly the connect by I'm using
> - Does it work in the CDB (if multitenant)?
> - Does it work with lower optimizer features enabled?
> - As we are already manually looking into what LGWR and its slaves are doing, there's no need for the wait chains script here
>
> ***oracle has filed a bug. altering optimizer_features_enabled doesn't help.
>
> 6) Linux Process Snapper - pSnapper
> - If you want to see the full "wait" states of Linux processes/threads, use the -a option.
> - It's somewhat like always showing "idle" threads too
> - But for linux apps it's harder to determine what's idle and what's some network RPC wait, so I currently just do what Linux "load" metric does
> - I just shows threads in Runnable and Disk/Uninterruptible sleep states by default (-a shows all states, including Sleep and kernel-Idle threads etc)
> - So the Linux pSnapper "idle" means a somewhat different thing than Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not be shown as Idle in Oracle, but at the Linux process level it's "just some network socket wait" and you'd need to use -a option to show all of them
>
> *** will add that option to future collections (and watch the video/read help)
>
> 7) Were LGWR/LG* processes mostly idle during the problem time?
> - As I kept reading the thread, it looks like ASH reported LGWR being mostly idle during the problem time
> - This is not exactly the same pattern, but I've seen cases in the past where everyone's waiting for LGWR and LGWR doesn't seem to be busy and doesn't show any significant waits
> - It turned out that LGWR was actually waiting for CKPT to make progress, but it wasn't synchronously waiting, but just going to sleep and "checking again later" in a loop
> - I wrote about this here: https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/ <https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/>
> - So it may still make sense to check what LGWR/LG* processes are doing even if they don't seem to be active much (like do you see an occasional controlfile read wait by LGWR etc)
> - SQL*Trace would capture all activity of LGWR, but may slow things down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that keeps a log of last 10 waits of every session in memory)
>
> *** i think i've addressed most of this.
>
> Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan _at_ 12pm ET) about complex Oracle performance problems that span multiple layers and have multiple moving parts. I can add some LGWR troubleshooting magic in the end of that (here goes my plan to fit the session in 1.5h again!). I'll post the details in my blog in a few minutes.
>
> *** can't wait!!
> --
> Tanel Poder
> https://blog.tanelpoder.com/seminar <https://blog.tanelpoder.com/seminar>
>
>
> On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com <mailto:cstephens16_at_gmail.com>> wrote:
> doesn't look like any ash data for lgwr:
>
> SQL> _at_ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION EVENT2 FIRST_SEEN LAST_SEEN DIST_SQLEXEC_SEEN
> _______________ ______ __________ _____________ ___________________ _____________________ ______________________ ______________________ ____________________
> 81 0.7 82% | 1713 1710 log file sync 2020-01-27 15:50:54 2020-01-27 15:52:14 1
> 5 0 5% | 1221 ON CPU 2020-01-27 15:50:59 2020-01-27 15:52:14 1
> 3 0 3% | 246 ON CPU 2020-01-27 15:51:00 2020-01-27 15:52:00 1
> 3 0 3% | 1713 ON CPU 2020-01-27 15:50:51 2020-01-27 15:52:27 2
> 1 0 1% | 123 ON CPU 2020-01-27 15:51:35 2020-01-27 15:51:35 1
> 1 0 1% | 127 ON CPU 2020-01-27 15:51:50 2020-01-27 15:51:50 1
> 1 0 1% | 252 2321 latch free 2020-01-27 15:52:28 2020-01-27 15:52:28 1
> 1 0 1% | 978 ges remote message 2020-01-27 15:50:48 2020-01-27 15:50:48 1
> 1 0 1% | 983 latch free 2020-01-27 15:52:28 2020-01-27 15:52:28 1
> 1 0 1% | 1713 library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48 1
> 1 0 1% | 1831 ON CPU 2020-01-27 15:50:49 2020-01-27 15:50:49 1
>
>
> 11 rows selected.
>
>
> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710 "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>
> no rows selected
>
> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <nenad.noveljic_at_vontobel.com <mailto:nenad.noveljic_at_vontobel.com>> wrote:
> You’re using v0.2, but v0.6 should be the latest version. 0.2 wouldn’t show idle blockers.
>
> I looked at the ashtop output once again. Strange that lgwr processes nor any other bg processes show up significantly there. The question is, what was lgwr doing while the fg processes were waiting on him.
>

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Feb 02 2020 - 13:51:14 CET

Original text of this message