Re: intermittent long "log file sync" waits

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Sun, 2 Feb 2020 17:09:10 -0500
Message-ID: <CAMHX9JKBmpR3JBOL21hFPGkL5J6bzVh86ucCkbAdTQNcNxCcUA_at_mail.gmail.com>



Btw, adaptive log file sync is a different thing than the adaptive switch between LGWR-only processing vs. LGWR+LG% slave processing (that's why I mentioned this parameter: _use_single_log_writer=adaptive)

On Sun, Feb 2, 2020 at 10:20 AM Chris Stephens <cstephens16_at_gmail.com> wrote:

> SQL> _at_ashtopLFS username,event2,program2,blocking_session,blocking_inst_id
> "program like '%(LG%)'" "TIMESTAMP'2020-01-29 13:04:00'"
> "TIMESTAMP'2020-01-29 13:05:00'"
> no rows selected
> SQL>
>
> i pulled over all the tables/views referenced in ashtop.sql and
> ash_wait_chains.sql to a single instance database. Hence, "ashtopLFS.sql"
>
> From the beginning we noticed slow writes to local disc (>40ms at times).
> On Friday, as I was running out of ideas, the other DBA on the project
> suggested we create a ramdisk for /tmp as we noticed lots of writes to
> /tmp/.oracle. After doing that and rebooting all nodes, we are now unable
> to reproduce the issue.
>
> I'd already turned off adaptive log file syn:
>
> SQL> _at_pd "_use_adaptive"
> Show all parameters and session values from x$ksppi/x$ksppcv...
>
> NUM N_HEX NAME
> VALUE DESCRIPTION
> ---------- ----- --------------------------------------------------------
> ------------------------------
> ---------------------------------------------------------------------------------------------------
> 2049 801 _use_adaptive_log_file_sync FALSE
> Adaptively switch between post/wait and polling
>
> SQL>
>
> On Monday (or Tuesday if I end up taking tomorrow off), we plan to remove
> the ramdisk and see if the issue reproduces again.
>
>
>
> On Sat, Feb 1, 2020 at 9:34 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>
>> Hi Chris,
>>
>> As 12c+ has log writer slaves and Oracle can adaptavely switch from
>> single LGWR to LGWR + LGxx slave (_use_single_log_writer=adaptive), can you
>> check ASH data for the LGxx slaves at this time too. Not all
>> process-to-process interactions/waits are instrumented with a
>> blocking_session in Oracle...
>>
>> Should be this command:
>>
>> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>> "program like '%(LG%)'" "TIMESTAMP'2020-01-29 13:04:00'"
>> "TIMESTAMP'2020-01-29 13:05:00'"
>>
>> --
>> Tanel Poder
>> https://blog.tanelpoder.com/seminar
>>
>>
>> On Thu, Jan 30, 2020 at 3:44 PM Chris Stephens <cstephens16_at_gmail.com>
>> wrote:
>>
>>> Here's some additional ash output from the trace data posted in last
>>> message including event_hist.sql:
>>>
>>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>>> 1=1 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
>>> TOTALSECONDS AAS %This USERNAME EVENT2
>>> PROGRAM2 BLOCKING_SESSION BLOCKING_INST_ID FIRST_SEEN
>>> LAST_SEEN DIST_SQLEXEC_SEEN
>>> _______________ ______ __________ ________________ ________________
>>> _____________ ___________________ ___________________
>>> ______________________ ______________________ ____________________
>>> 2 0.3 50% | SYS ON CPU
>>> (sqlplus) 2020-01-29 13:04:24
>>> 2020-01-29 13:04:25 1
>>> 1 0.1 25% | GEN3_MGOWER_3 ON CPU
>>> (python) 2020-01-29 13:04:22
>>> 2020-01-29 13:04:22 1
>>> 1 0.1 25% | GEN3_MGOWER_3 log file sync
>>> (python) 1710 2 2020-01-29 13:04:20
>>> 2020-01-29 13:04:20 1
>>>
>>>
>>> SQL> _at_bg lgwr
>>> NAME DESCRIPTION SID OPID SPID PADDR
>>> SADDR
>>> _______ ______________ _______ _______ ________ ___________________
>>> ___________________
>>> LGWR Redo etc. 1710 34 26552 00000001E8718860
>>> 00000001D8BFF4A0
>>>
>>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>>> session_id=1710 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29
>>> 13:04:27'"
>>>
>>> no rows selected
>>>
>>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>>> session_id=1710 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29
>>> 13:05:00'"
>>>
>>> no rows selected
>>>
>>> SQL> _at_ash/event_hist "log file" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
>>> "TIMESTAMP'2020-01-29 13:05:00'"
>>> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT
>>> EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH
>>> FIRST_SEEN LAST_SEEN
>>> ________________ __________________ ___________________
>>> ______________________ ____________________ _______________ _______________
>>> ______________________ ______________________
>>> log file sync < 1024 1
>>> 1.8 1.8 12.8 |# | 2020-01-29
>>> 13:04:13 2020-01-29 13:04:13
>>> log file sync < 4096 1
>>> 1 4.1 29.1 |### | 2020-01-29
>>> 13:04:12 2020-01-29 13:04:12
>>> log file sync < 8192 1
>>> 1 8.2 58.2 |###### | 2020-01-29
>>> 13:04:20 2020-01-29 13:04:20
>>>
>>> SQL> _at_ash/event_hist "write" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
>>> "TIMESTAMP'2020-01-29 13:05:00'"
>>>
>>> no rows selected
>>>
>>> SQL> _at_ash/event_hist "*" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
>>> "TIMESTAMP'2020-01-29 13:05:00'"
>>> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT
>>> EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH
>>> FIRST_SEEN LAST_SEEN
>>> _________________ __________________ ___________________
>>> ______________________ ____________________ _______________ _______________
>>> ______________________ ______________________
>>> log file sync < 1024 1
>>> 1.8 1.8 12.8 |# | 2020-01-29
>>> 13:04:13 2020-01-29 13:04:13
>>> log file sync < 4096 1
>>> 1 4.1 29.1 |### | 2020-01-29
>>> 13:04:12 2020-01-29 13:04:12
>>> log file sync < 8192 1
>>> 1 8.2 58.2 |###### | 2020-01-29
>>> 13:04:20 2020-01-29 13:04:20
>>> row cache lock < 1 1
>>> 2469.1 2.5 100 |##########| 2020-01-29
>>> 13:04:01 2020-01-29 13:04:01
>>>
>>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>>> "lower(program) like '%lg%'" "TIMESTAMP'2020-01-29 13:04:00'"
>>> "TIMESTAMP'2020-01-29 13:05:00'"
>>>
>>> no rows selected
>>>
>>> i'm going to copy over gv$active_session_history to a single instance
>>> database and take a look at ash_wait_chains but i don't expect that to
>>> reveal anything either.
>>>
>>> i need to confirm whether adding redo resolved issue (will have
>>> developer fire up workload) but I think next step is to investigate x$trace
>>> and possible process communication issues?
>>>
>>> On Thu, Jan 30, 2020 at 10:27 AM Chris Stephens <cstephens16_at_gmail.com>
>>> wrote:
>>>
>>>> Hey Tanel,
>>>>
>>>> Thanks for taking time to write this up! Will definitely be attending
>>>> that webinar.
>>>>
>>>> comments inline
>>>>
>>>> On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel_at_tanelpoder.com>
>>>> wrote:
>>>>
>>>>> Comments:
>>>>>
>>>>> 1) First, looks like you're using all the right tools for systematic
>>>>> troubleshooting ;-)
>>>>>
>>>>
>>>> ***i try!
>>>>
>>>>>
>>>>> 2) As "log file sync" is not directly an I/O wait event (IO happens
>>>>> indirectly under the hood), but "wait for LGWR to ACK/increment redo
>>>>> persisting position" then there may be multiple reasons for delays. I'm
>>>>> repeating some of what others have said here too:
>>>>> - LGWR could be waiting for remote ACK (synchronous data guard) -
>>>>> that should show up as a LNS* wait or something like that though
>>>>> - Process slowness due to CPU overload/scheduling latency or
>>>>> swapping (in both cases, LGWR may do its work fast, but it takes a while
>>>>> for the foreground process to wake up and end the wait event - you could
>>>>> check if there are any reports regarding switching LGWR mode in alert.log
>>>>> around that time)
>>>>> - Bug/missed post due to switching between LGWR progress polling vs.
>>>>> waiting for LGWR post
>>>>> - ASM instance communication issues (ASM communication can be on the
>>>>> critical path when creating/changing files on ASM and even for reading,
>>>>> when the ASM metadata cache in your local instance doesn't have what it
>>>>> needs)
>>>>> - So, the physical redo log file disk write is not necessarily the
>>>>> biggest contributor to the log file sync wait
>>>>>
>>>>> *** no data guard in the env.
>>>> *** we have excess CPU. (20 cores / instance and usually only 1-5
>>>> active foreground processes but i will confirm from TFA collections)
>>>> *** yesterday i was ready to blame bug. today, not so sure. i checked
>>>> log file switches with a script google found me a few days ago. now
>>>> thinking that was inaccurate as a script from metastink shows tones of
>>>> switching so that could very well be related.
>>>> ***asm communication is something i need to look into. TFA has reported
>>>> some ASM issues but there's no details and i never saw anything in asm
>>>> alert logs indicating an issue.
>>>>
>>>>
>>>>> 3) You'll need to measure what LGWR itself was doing during the
>>>>> problem time
>>>>> - Since you're on 19c (12c or later) the LGWR has slaves too and
>>>>> they may be doing most of the work
>>>>> - You'll need to look into both the LGWR process and any other LG*
>>>>> ones too
>>>>> - Since you have (Oracle) Snapper running already and writing to
>>>>> trace you should have the data already
>>>>> - The *lgwr* in the following _at_snapper ash,stats,trace 10 999999
>>>>> lgwr syntax includes LGWR slaves too
>>>>> - As LGWR slaves have their own sessions (and SIDs), you should run
>>>>> ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR
>>>>> and slaves
>>>>> - If LGWR and the slaves were listed *idle* during the problem time
>>>>> (while others waited for log file sync event at the same time) *and
>>>>> if *there was no huge OS level CPU/memory shortage at the time, this
>>>>> looks like a background <-> foreground communication problem.
>>>>> - This (and any potential ASM issues) could probably be
>>>>> troubleshooted via KST tracing, KSR channels and X$TRACE that could help
>>>>> here (some of it is enabled by default ( Like I showed in a previous
>>>>> hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w )
>>>>>
>>>>>
>>>> here is all the output i have corresponding to a 6 sec "log file sync"
>>>> wait:
>>>>
>>>> Experience duration: *R* = *t*1 − *t*0 = 41.336137 seconds
>>>> <#m_7864118618086203961_m_-5505970199061174881_m_7542945560652427325_m_-5509691095923674063_profile-by-subroutine>
>>>> Experience begin time: *t*0 = 2020-01-29T13:04:07.560478
>>>> Experience end time: *t*1 = 2020-01-29T13:04:48.896615
>>>> CALL-NAME DURATION % CALLS MEAN
>>>> MIN MAX
>>>> ------------------------------ --------- ------ ------ --------
>>>> -------- --------
>>>> SQL*Net message from client 26.017105 66.5% 9,886 0.002632
>>>> 0.000209 7.858051
>>>> log file sync 10.582409 27.1% 150 0.070549
>>>> 0.000295 6.909711
>>>> EXEC 2.088757 5.3% 9,702 0.000215
>>>> 0.000000 0.209199
>>>> DLM cross inst call completion 0.109617 0.3% 212 0.000517
>>>> 0.000005 0.080575
>>>> row cache lock 0.070118 0.2% 125 0.000561
>>>> 0.000208 0.000972
>>>> FETCH 0.047586 0.1% 1,029 0.000046
>>>> 0.000000 0.003056
>>>> PARSE 0.045293 0.1% 4,778 0.000009
>>>> 0.000000 0.002692
>>>> CLOSE 0.033453 0.1% 4,784 0.000007
>>>> 0.000000 0.000237
>>>> library cache lock 0.023614 0.1% 55 0.000429
>>>> 0.000152 0.000901
>>>> library cache pin 0.022311 0.1% 55 0.000406
>>>> 0.000153 0.000689
>>>> 17 others 0.071430 0.2% 10,772 0.000007
>>>> 0.000000 0.001487
>>>> ------------------------------ --------- ------ ------ --------
>>>> -------- --------
>>>> TOTAL (27) 39.111693 100.0% 41,548 0.000941
>>>> 0.000000 7.858051
>>>>
>>>> 6 sec wait from raw data:
>>>>
>>>> *** 2020-01-29T13:04:20.724666-06:00
>>>> WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync
>>>> scn=976089279 p3=0 obj#=-1 tim=1025228531958
>>>> WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536
>>>> #bytes=1 p3=0 obj#=-1 tim=1025228532118
>>>> WAIT #0: nam='SQL*Net message from client' ela= 1711 driver
>>>> id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228533852
>>>> BINDS #140204587734864:
>>>>
>>>> i enabled 10046 on lgwr:
>>>>
>>>> *** 2020-01-29T13:04:20.725994-06:00
>>>> WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0
>>>> obj#=-1 tim=1025228533313
>>>> WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17
>>>> requests=2 obj#=-1 tim=1025228533777
>>>> WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3
>>>> requests=2 obj#=-1 tim=1025228534104
>>>> WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0
>>>> p3=0 obj#=-1 tim=1025228534182
>>>> WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2
>>>> requests=2 obj#=-1 tim=1025228534437
>>>> WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2
>>>> requests=2 obj#=-1 tim=1025228534721
>>>> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1
>>>> tim=1025228534786
>>>> WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1
>>>> tim=1025228538038
>>>> WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2
>>>> requests=1 obj#=-1 tim=1025228538297
>>>> WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1
>>>> tim=1025228538358
>>>> WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1
>>>> tim=1025228550204
>>>> WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0
>>>> obj#=-1 tim=1025228550294
>>>> WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7
>>>> requests=1 obj#=-1 tim=1025228550624
>>>> WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1
>>>> tim=1025228550699
>>>> WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1
>>>> tim=1025228567467
>>>> WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228567762
>>>> WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1
>>>> tim=1025228567825
>>>> WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1
>>>> tim=1025228584091
>>>> WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228584400
>>>> WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1
>>>> tim=1025228584502
>>>> WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1
>>>> tim=1025228600993
>>>> WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228601393
>>>> WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1
>>>> tim=1025228601482
>>>> WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1
>>>> tim=1025228611053
>>>> WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0
>>>> obj#=-1 tim=1025228618574
>>>> WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0
>>>> obj#=-1 tim=1025228618636
>>>> WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228618960
>>>> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1
>>>> tim=1025228619066
>>>> WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0
>>>> obj#=-1 tim=1025228635045
>>>> WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228635427
>>>> WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1
>>>> tim=1025228635525
>>>> WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0
>>>> obj#=-1 tim=1025228651537
>>>> WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228651912
>>>> WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1
>>>> tim=1025228652034
>>>> WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0
>>>> obj#=-1 tim=1025228668405
>>>> WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228668804
>>>> WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1
>>>> tim=1025228668911
>>>> WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0
>>>> obj#=-1 tim=1025228685907
>>>> WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9
>>>> requests=1 obj#=-1 tim=1025228686327
>>>> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1
>>>> tim=1025228686453
>>>> WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0
>>>> obj#=-1 tim=1025228693390
>>>> WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2
>>>> requests=1 obj#=-1 tim=1025228693768
>>>> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1
>>>> tim=1025228693862j
>>>> ...more of the same
>>>> 264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0
>>>> obj#=-1 tim=1025231090742
>>>> 264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0
>>>> p3=0 obj#=-1 tim=1025231110227
>>>> 264398 WAIT #0: nam='log file parallel write' ela= 272 files=1
>>>> blocks=9 requests=1 obj#=-1 tim=1025231110633
>>>> 264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0
>>>> obj#=-1 tim=1025231110748
>>>> 264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0
>>>> p3=0 obj#=-1 tim=1025231126261
>>>> 264401 WAIT #0: nam='log file parallel write' ela= 247 files=1
>>>> blocks=4 requests=1 obj#=-1 tim=1025231126641
>>>> 264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0
>>>> obj#=-1 tim=1025231126753
>>>> 264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0
>>>> p3=0 obj#=-1 tim=1025231143514
>>>> 264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8
>>>> p2=0 p3=0 obj#=-1 tim=1025231143603
>>>> 264405 WAIT #0: nam='log file parallel write' ela= 279 files=1
>>>> blocks=9 requests=1 obj#=-1 tim=1025231143976
>>>> 264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0
>>>> obj#=-1 tim=1025231144093
>>>> 264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0
>>>> obj#=-1 tim=1025231149515
>>>> 264408 WAIT #0: nam='log file parallel write' ela= 255 files=1
>>>> blocks=4 requests=1 obj#=-1 tim=1025231149889
>>>> 264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0
>>>> obj#=-1 tim=1025231149989
>>>> 264410
>>>> 264411 *** 2020-01-29T13:04:23.802774-06:00
>>>> 264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0
>>>> p3=0 obj#=-1 tim=1025231610086
>>>> 264413
>>>> 264414 *** 2020-01-29T13:04:26.797925-06:00
>>>> 264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0
>>>> p3=0 obj#=-1 tim=1025234605238
>>>> 264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0
>>>> obj#=-1 tim=1025234616054
>>>> 264417
>>>>
>>>> i will get you snapper lgwr output associated w/ this interval as soon
>>>> as i can find it (can't remember where i stashed it and lost trace file
>>>> name .. fml)
>>>>
>>>> 4) Paste the full *Snapper* output (of one time snapshot from the
>>>>> problem time) here if you can - we'll be able to see what the slaves were
>>>>> doing
>>>>> - You'd need to include not only LGWR SID but also any LG* slave
>>>>> SIDs too
>>>>> - Also we'll see better LGWR delay metrics from V$SESSTAT that go
>>>>> beyond just wait events
>>>>> - Thinking of metrics like these - snapper already handles them (if
>>>>> you look into the rightmost column)
>>>>> - *redo synch time overhead (usec)* - "FG wakeup overhead per
>>>>> log file sync"
>>>>> - *redo write time* - accounting end to end redo write time (not
>>>>> just the log file parallel write syscall) "per redo write"
>>>>>
>>>>> 5) Regarding the *ASH wait chains* error on 19c - I'm aware of this
>>>>> but forgot to look into it
>>>>> - It has (probably) something to do with GV$ views (or PX slaves
>>>>> used for PDB V$ queries) and possibly the connect by I'm using
>>>>> - Does it work in the CDB (if multitenant)?
>>>>> - Does it work with lower optimizer features enabled?
>>>>> - As we are already manually looking into what LGWR and its slaves
>>>>> are doing, there's no need for the wait chains script here
>>>>>
>>>>> ***oracle has filed a bug. altering optimizer_features_enabled doesn't
>>>> help.
>>>>
>>>>
>>>>> 6) Linux Process Snapper - pSnapper
>>>>> - If you want to see the full "wait" states of Linux
>>>>> processes/threads, use the* -a *option.
>>>>> - It's somewhat like always showing "idle" threads too
>>>>> - But for linux apps it's harder to determine what's idle and
>>>>> what's some network RPC wait, so I currently just do what Linux "load"
>>>>> metric does
>>>>> - I just shows threads in Runnable and Disk/Uninterruptible sleep
>>>>> states by default (-a shows all states, including Sleep and kernel-Idle
>>>>> threads etc)
>>>>> - So the Linux pSnapper "idle" means a somewhat different thing
>>>>> than Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should
>>>>> not be shown as Idle in Oracle, but at the Linux process level it's "just
>>>>> some network socket wait" and you'd need to use *-a* option to show
>>>>> all of them
>>>>>
>>>>
>>>> *** will add that option to future collections (and watch the
>>>> video/read help)
>>>>
>>>>>
>>>>> 7) Were LGWR/LG* processes mostly idle during the problem time?
>>>>> - As I kept reading the thread, it looks like ASH reported LGWR
>>>>> being mostly idle during the problem time
>>>>> - This is not exactly the same pattern, but I've seen cases in the
>>>>> past where everyone's waiting for LGWR and LGWR doesn't seem to be busy and
>>>>> doesn't show any significant waits
>>>>> - It turned out that LGWR was actually waiting for CKPT to make
>>>>> progress, but it wasn't synchronously waiting, but just going to sleep and
>>>>> "checking again later" in a loop
>>>>> - I wrote about this here:
>>>>> https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
>>>>> - So it may still make sense to check what LGWR/LG* processes are
>>>>> doing even if they don't seem to be active much (like do you see an
>>>>> occasional controlfile read wait by LGWR etc)
>>>>> - SQL*Trace would capture all activity of LGWR, but may slow things
>>>>> down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that
>>>>> keeps a log of last 10 waits of every session in memory)
>>>>>
>>>>> *** i think i've addressed most of this.
>>>>
>>>>
>>>>> Incidentally I'm doing another hacking session tomorrow (Thursday 30
>>>>> Jan _at_ 12pm ET) about complex Oracle performance problems that span multiple
>>>>> layers and have multiple moving parts. I can add some LGWR troubleshooting
>>>>> magic in the end of that (here goes my plan to fit the session in 1.5h
>>>>> again!). I'll post the details in my blog in a few minutes.
>>>>>
>>>>> *** can't wait!!
>>>>
>>>>> --
>>>>> Tanel Poder
>>>>> https://blog.tanelpoder.com/seminar
>>>>>
>>>>>
>>>>> On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>> doesn't look like any ash data for lgwr:
>>>>>>
>>>>>> SQL> _at_ashtop session_id,blocking_session,event2 1=1
>>>>>> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>>>>>> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION
>>>>>> EVENT2 FIRST_SEEN LAST_SEEN
>>>>>> DIST_SQLEXEC_SEEN
>>>>>> _______________ ______ __________ _____________ ___________________
>>>>>> _____________________ ______________________ ______________________
>>>>>> ____________________
>>>>>> 81 0.7 82% | 1713 1710
>>>>>> log file sync 2020-01-27 15:50:54 2020-01-27 15:52:14
>>>>>> 1
>>>>>> 5 0 5% | 1221
>>>>>> ON CPU 2020-01-27 15:50:59 2020-01-27 15:52:14
>>>>>> 1
>>>>>> 3 0 3% | 246
>>>>>> ON CPU 2020-01-27 15:51:00 2020-01-27 15:52:00
>>>>>> 1
>>>>>> 3 0 3% | 1713
>>>>>> ON CPU 2020-01-27 15:50:51 2020-01-27 15:52:27
>>>>>> 2
>>>>>> 1 0 1% | 123
>>>>>> ON CPU 2020-01-27 15:51:35 2020-01-27 15:51:35
>>>>>> 1
>>>>>> 1 0 1% | 127
>>>>>> ON CPU 2020-01-27 15:51:50 2020-01-27 15:51:50
>>>>>> 1
>>>>>> 1 0 1% | 252 2321
>>>>>> latch free 2020-01-27 15:52:28 2020-01-27 15:52:28
>>>>>> 1
>>>>>> 1 0 1% | 978
>>>>>> ges remote message 2020-01-27 15:50:48 2020-01-27 15:50:48
>>>>>> 1
>>>>>> 1 0 1% | 983
>>>>>> latch free 2020-01-27 15:52:28 2020-01-27 15:52:28
>>>>>> 1
>>>>>> 1 0 1% | 1713
>>>>>> library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48
>>>>>> 1
>>>>>> 1 0 1% | 1831
>>>>>> ON CPU 2020-01-27 15:50:49 2020-01-27 15:50:49
>>>>>> 1
>>>>>>
>>>>>>
>>>>>> 11 rows selected.
>>>>>>
>>>>>>
>>>>>> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710
>>>>>> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>>>>>>
>>>>>> no rows selected
>>>>>>
>>>>>> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <
>>>>>> nenad.noveljic_at_vontobel.com> wrote:
>>>>>>
>>>>>>> You’re using v0.2, but v0.6 should be the latest version. 0.2
>>>>>>> wouldn’t show idle blockers.
>>>>>>>
>>>>>>> I looked at the ashtop output once again. Strange that lgwr
>>>>>>> processes nor any other bg processes show up significantly there. The
>>>>>>> question is, what was lgwr doing while the fg processes were waiting on him.
>>>>>>>
>>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Feb 02 2020 - 23:09:10 CET

Original text of this message