Re: LGWR, EMC or app cursors?

From: Andy Sayer <andysayer_at_gmail.com>
Date: Thu, 10 Oct 2019 18:50:43 +0100
Message-ID: <CACj1VR6T0QNfg8UouTZF=23RYTZ3_WMB1mAXGK+Od49YC3K3vA_at_mail.gmail.com>





Storage folks may not be trustworthy, or may not be checking the right things. Remember not every system is as well instrumented as your Oracle DB!

If you are seeing 30+ seconds for an individual log file parallel write then the IO is almost certainly going missing. You can “handle” this by decreasing your disk timeout (it sounds like it’s 30 seconds which is way too pessimistic for modern storage speeds). This just means that your 30 second IO requests will be retried sooner.

You will probably see this occurring more than once and on more than just your redo files (slow redo writes are more noticeable because everything will wait on that eventually).

While the application can probably do less work, it’s not the application’s fault it sometimes has to spend half a minute doing nothing.

Thanks,
Andy

On Thu, 10 Oct 2019 at 18:10, Herring, David <dmarc-noreply_at_freelists.org> wrote:

> The storage folks are assuring me that there are no issues at that level
> and that any shared resources are NOT seeing the same thing. I have to
> take their word for it.
>
>
>
> As for oswatcher details, yes, it's running and I've reviewed "ps" data
> looking for the LGWR process. For WCHAN around one of the times we had a
> problem, LGWR lists "SYSC_s" and "asm_ma". I'll see if I can track down
> what that means. OSW runs every 30 sec so obviously there's a lot of
> detail missing when the problem happens.
>
>
>
> Regards,
>
>
>
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Dave
>
>
>
> [image: cid:image001.png_at_01D05044.5C2AEE60]
>
>
>
> *Dave Herring*
>
> DBA
>
> 103 JFK Parkway
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Short Hills, New Jersey
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
> 07078
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Mobile 630.441.4404
>
>
>
> *dnb.com <http://www.dnb.com/>*
>
>
>
> [image: cid:image002.png_at_01D05044.5C2AEE60]
> <http://www.facebook.com/DunBradstreet>[image:
> cid:image003.png_at_01D05044.5C2AEE60] <http://twitter.com/dnbus>[image:
> cid:image004.png_at_01D05044.5C2AEE60]
> <http://www.linkedin.com/company/dun-&-bradstreet>[image:
> cid:image005.png_at_01D05044.5C2AEE60]
> <http://www.youtube.com/user/DunandBrad>
>
>
>
> *From:* oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> *On
> Behalf Of *Tiwari, Yogesh
> *Sent:* Wednesday, October 9, 2019 11:13 PM
> *To:* oracle-l_at_freelists.org
> *Cc:* Tiwari, Yogesh <Yogesh.Tiwari_at_fidelity.co.in>
> *Subject:* RE: LGWR, EMC or app cursors?
>
>
>
> *CAUTION:* This email originated from outside of D&B. Please do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>
> Dave,
>
>
>
> If you are sharing SAN with other databases hosts, probably they should
> also complain at the same time. Is this true for you?
>
>
>
> Further, you might want to capture WCHAN for LGWR(s) processes, when it
> goes into this hang. This can give clues, if it is really SAN, or OS issue.
>
> Tools like oswatcher(TFA) do capture complete “ps” detail, which can help
> in this situation. Do see that as an option?
>
>
>
> Thanks,
>
> *Yogi *
>
> Disclaimer: The information transmitted is intended for the person or
> entity to which it is addressed and may contain confidential, privileged or
> copyrighted material or attorney work product. If you receive this in
> error, please contact the sender and delete the material from any system.
> Any unauthorised copying, disclosure or distribution of the material in
> this e-mail is strictly forbidden. Any comments or statements made are not
> necessarily those of Fidelity. All e-mails may be monitored or recorded.
>
>
>
> *From:* oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> *On
> Behalf Of *Herring, David
> *Sent:* 10 October 2019 08:11
> *To:* martin.a.berger_at_gmail.com; dmarc-noreply_at_freelists.org
> *Cc:* oracle-l_at_freelists.org
> *Subject:* RE: LGWR, EMC or app cursors?
>
>
>
> First, thanks to a number of you who replied. The problem still exists
> but at least I know a bit more about it. I read through Frits' post and
> it's quite interesting and informative, but didn't help. One potentially
> serious problem in debugging this is "strace" or any type of system trace
> is not available on these servers, my guess is that the security team felt
> having access to that is a "no-no". Of course I seriously doubt it'd be
> realistic to strace LGWR and let it run for hrs., waiting for the problem
> to occur (potentially large performance impact, let alone a giant
> tracefile).
>
>
>
> Unfortunately I'm back to trying to figure out details on exactly what
> LGWR is doing during it's "log file parallel write". Per Andy's suggestion
> I validated that the column SEQ# in ASH doesn't change during the duration
> of the problem for LGWR, so it's one huge wait. In fact seconds before the
> one example that I’m trying to tear apart I see LGWR waiting on the same
> event but it's a different SEQ# so it got some work done, then just spun
> for nearly 30 seconds while all other DML sat and waited on "log file
> sync". LGWR finally gets it's work done, everything back to normal.
>
>
>
> I'm going to go back to the full issue bridge list (we have calls on this
> daily with SMEs covering all areas) and see if I can get a 100%
> confirmation that they've validated all components inbetween LGWR and the
> physical disk.
>
>
>
> Regards,
>
>
>
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Dave
>
>
>
> [image: cid:image001.png_at_01D05044.5C2AEE60]
>
>
>
> *Dave Herring*
>
> DBA
>
> 103 JFK Parkway
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Short Hills, New Jersey
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
> 07078
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Mobile 630.441.4404
>
>
>
> *dnb.com
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.dnb.com_%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3D4I3jWkyQvUfkqrqScPRJXxyxCuIHceXrzcAHeEt0xFA%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467876402&sdata=A8%2BBtmEBTT3OBVCi4dxeHhYf%2B%2Fd49oSdG6Hy%2FQ0d9Uo%3D&reserved=0>*
>
>
>
> [image: cid:image002.png_at_01D05044.5C2AEE60]
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.facebook.com_DunBradstreet%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3DSd1LSqrlzEjtqnmk_m-YzBDBT2cZMCAsGqmpeljDNK4%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467886405&sdata=OJ5KeZxFAQMHAm0yiQF3qu1LGdQ5Ry96NAOnf%2BlZ6QQ%3D&reserved=0>[image:
> cid:image003.png_at_01D05044.5C2AEE60]
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__twitter.com_dnbus%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3Dc6R9XatHBSNI17aWxLldX6t7eB6odm1x-nlmj7_ZrjQ%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467886405&sdata=1qqzSP8cCrHOgU6CzrG9Hjbs4C5MVQnbYIsnIX2PHqI%3D&reserved=0>[image:
> cid:image004.png_at_01D05044.5C2AEE60]
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.linkedin.com_company_dun-2D-26-2Dbradstreet%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3D-w7ioUxIJtdWORsVFEEqjIhyzPnL_jwbqlty6waEGUI%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467896390&sdata=y5mDX5zY56aAUuciBMPC89sfQEpJZa%2BgC%2Fy3%2BFNyo0o%3D&reserved=0>[image:
> cid:image005.png_at_01D05044.5C2AEE60]
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.youtube.com_user_DunandBrad%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3DzhYduT9jy7n_3jS3SJBQsLdVogCT1udzX_SIAe_3KS8%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467906384&sdata=ntN%2BzI7AL6h4o5rD7mVStpbg7dWa4yKlyyUqoLbIkFM%3D&reserved=0>
>
>
>
> *From:* oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> *On
> Behalf Of *Martin Berger
> *Sent:* Tuesday, October 8, 2019 2:34 AM
> *To:* dmarc-noreply_at_freelists.org
> *Cc:* oracle-l_at_freelists.org
> *Subject:* Re: LGWR, EMC or app cursors?
>
>
>
> *CAUTION:* This email originated from outside of D&B. Please do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>
> Hi Dave,
>
>
>
> as you asked for tracing, a "normal" 10046 trace can be enabled for
> logwriter
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam03.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ffritshoogland.files.wordpress.com-252F2014-252F04-252Fprofiling-2Dthe-2Dlogwriter-2Dand-2Ddatabase-2Dwriter.pdf-26data-3D02-257C01-257Cherringd-2540dnb.com-257Cf5bf752ea83c4c4c36fc08d74bc21067-257C19e2b708bf12437597198dec42771b3e-257C0-257C1-257C637061169179075851-26sdata-3DaTuydwpa7-252FrZwDsMemlNoGO-252BVCBOdYqys5T6rU6AG6o-253D-26reserved-3D0%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3Dvwjtbe9hdSGr0oJwFK3ZT-2E2ohv6TVwmu-Wvs_Hbjw%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467906384&sdata=1L9oicNn%2BP229Ps3V%2FDZT%2F1zOK2m783pIdfhRM0Vf8Q%3D&reserved=0>
> .
>
> You will not get SQL statements, but normal trace information regarding
> WAITs.
>
>
>
> The event log file parallel write is somehow tricky. Frits wrote a nice blog
> post
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam03.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ffritshoogland.wordpress.com-252F2013-252F08-252F30-252Foracle-2Dio-2Don-2Dlinux-2Dlog-2Dwriter-2Dio-2Dand-2Dwait-2Devents-252F-26data-3D02-257C01-257Cherringd-2540dnb.com-257Cf5bf752ea83c4c4c36fc08d74bc21067-257C19e2b708bf12437597198dec42771b3e-257C0-257C1-257C637061169179075851-26sdata-3D9yHLvzg3kvHMZdsz0FhRs4-252B3sRVKyFAtlF64P3n-252F8-252Bo-253D-26reserved-3D0%26d%3DDwMGaQ%26c%3DSsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA%26r%3Dz73EKtGMkOyHMZwSjVYW896tQVgTZQlAucPsWFx6Th0%26m%3DIcJTtnwLA4QbBqsz3xiIZBRtMymLghSGvDucAHCe4Qg%26s%3DhjOIzHWSJ3UVrSt-2pULB3JLSMAdSr3BgQXE2HQYmDA%26e%3D&data=02%7C01%7Cherringd%40dnb.com%7Cf4c9ab584d9f44527c2d08d74d38485f%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637062776467916380&sdata=cGdFGtNYL%2F0jz7BuPIyFG%2Br9TFgngrUhVYcJws3KypI%3D&reserved=0>
> about it.
>
> It's important to understand that it represents multiple IOs (that's the
> parallel).
>
>
>
> > "EMC and sysadmins have confirmed there are no disk errors and from
> their standpoint the disks are waiting on Oracle."
>
> I assume you have a (or two) FiberChannel SAN which connects EMS and your
> DB-host. Please ask them for measurements on those switches also.
>
> The argument is simple: If the host claims it waits on the disks
> (according to iostat) and EMC claims it's waiting on Oracle, have a closer
> look at the components in between.
>
>
>
> hth,
>
> Martin
>
>
>
>
>
> Am Mo., 7. Okt. 2019 um 17:20 Uhr schrieb Herring, David <
> dmarc-noreply_at_freelists.org>:
>
> Folks, I've got a bit of a mystery with a particular db where we're
> getting a periodic 25-30 pause between user sessions and LGWR processes and
> can't clearly identify what's the cause.
>
>
>
> - The database is 11.2.0.4, RHEL 7.5, running ASM on EMC.
> - Sometimes once a day, sometimes more (never more than 5) times a day
> we see user processes start waiting on "log file sync". LGWR is waiting on
> "log file parallel write".
> - At the same time one of the emcpower* devices shows 100% busy and
> service time 200+ (from iostat via osw). mpstat shows 1 CPU at 100% on
> iowait. It's not always the same disk (emcpowere1, a1, h1, …), not always
> the same CPU. EMC and sysadmins have confirmed there are no disk errors
> and from their standpoint the disks are waiting on Oracle.
> - During this time LGWR stats in ASH are all 0 - TIME_WAITED, DELTA*
> columns. Only after the problem goes away (about 25 secs) these columns
> are populated again, obviously the DELTA* columns 1 row later. LGWR's
> session state is WAITING so I assume the column value observations are due
> to LGWR waiting, as it won't write stats until it can do something.
>
>
>
> I am stuck trying to find out, really prove who is the culprit or what
> exactly the wait is on. Is LGWR waiting on user sessions and user sessions
> are waiting on LGWR and all that causes the disk to be 100%? Can I enable
> some sort of tracing on LGWR and would that point to exactly what he's
> waiting on to prove where the problem is?
>
>
>
> Regards,
>
>
>
> Dave
>
>











--
http://www.freelists.org/webpage/oracle-l


Received on Thu Oct 10 2019 - 19:50:43 CEST

Original text of this message