Re: LGWR, EMC or app cursors?

From: Jared Still <jkstill_at_gmail.com>
Date: Fri, 18 Oct 2019 08:02:32 -0700
Message-ID: <CAORjz=O_E=s2Hxazu9jwvMzXS57O+GVAgev6wzmjyj05U2s1kw_at_mail.gmail.com>





Hi Dave,

re strace, it does not require installation. the binary can be copied from a like system and executed.

strace attaches as a debugger to a process, so as you mentioned, not a good idea to run it for long on a critical process.

perf on the other hand does not attach itself to a process. as is ASH, perf is a sampler, though more frequent than ASH.

take a look here if unfamiliar with perf

http://www.brendangregg.com/perf.html

root access is often required for perf or strace, even when examining oracle processes and being run as oracle.

I haven't attempted to discover why that is, I have just experienced it several times

On Wed, Oct 9, 2019 at 19:42 Herring, David <dmarc-noreply_at_freelists.org> wrote:

> First, thanks to a number of you who replied. The problem still exists
> but at least I know a bit more about it. I read through Frits' post and
> it's quite interesting and informative, but didn't help. One potentially
> serious problem in debugging this is "strace" or any type of system trace
> is not available on these servers, my guess is that the security team felt
> having access to that is a "no-no". Of course I seriously doubt it'd be
> realistic to strace LGWR and let it run for hrs., waiting for the problem
> to occur (potentially large performance impact, let alone a giant
> tracefile).
>
>
>
> Unfortunately I'm back to trying to figure out details on exactly what
> LGWR is doing during it's "log file parallel write". Per Andy's suggestion
> I validated that the column SEQ# in ASH doesn't change during the duration
> of the problem for LGWR, so it's one huge wait. In fact seconds before the
> one example that I’m trying to tear apart I see LGWR waiting on the same
> event but it's a different SEQ# so it got some work done, then just spun
> for nearly 30 seconds while all other DML sat and waited on "log file
> sync". LGWR finally gets it's work done, everything back to normal.
>
>
>
> I'm going to go back to the full issue bridge list (we have calls on this
> daily with SMEs covering all areas) and see if I can get a 100%
> confirmation that they've validated all components inbetween LGWR and the
> physical disk.
>
>
>
> Regards,
>
>
>
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Dave
>
>
>
> [image: cid:image001.png_at_01D05044.5C2AEE60]
>
>
>
> *Dave Herring*
>
> DBA
>
> 103 JFK Parkway
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Short Hills, New Jersey
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
> 07078
> <https://www.google.com/maps/search/103+JFK+Parkway+%0D%0A+Short+Hills,+New+Jersey+07078?entry=gmail&source=g>
>
> Mobile 630.441.4404
>
>
>
> *dnb.com <http://www.dnb.com/>*
>
>
>
> [image: cid:image002.png_at_01D05044.5C2AEE60]
> <http://www.facebook.com/DunBradstreet>[image:
> cid:image003.png_at_01D05044.5C2AEE60] <http://twitter.com/dnbus>[image:
> cid:image004.png_at_01D05044.5C2AEE60]
> <http://www.linkedin.com/company/dun-&-bradstreet>[image:
> cid:image005.png_at_01D05044.5C2AEE60]
> <http://www.youtube.com/user/DunandBrad>
>
>
>
> *From:* oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> *On
> Behalf Of *Martin Berger
> *Sent:* Tuesday, October 8, 2019 2:34 AM
> *To:* dmarc-noreply_at_freelists.org
> *Cc:* oracle-l_at_freelists.org
> *Subject:* Re: LGWR, EMC or app cursors?
>
>
>
> *CAUTION:* This email originated from outside of D&B. Please do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>
> Hi Dave,
>
>
>
> as you asked for tracing, a "normal" 10046 trace can be enabled for
> logwriter
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffritshoogland.files.wordpress.com%2F2014%2F04%2Fprofiling-the-logwriter-and-database-writer.pdf&data=02%7C01%7Cherringd%40dnb.com%7Cf5bf752ea83c4c4c36fc08d74bc21067%7C19e2b708bf12437597198dec42771b3e%7C0%7C1%7C637061169179075851&sdata=aTuydwpa7%2FrZwDsMemlNoGO%2BVCBOdYqys5T6rU6AG6o%3D&reserved=0>
> .
>
> You will not get SQL statements, but normal trace information regarding
> WAITs.
>
>
>
> The event log file parallel write is somehow tricky. Frits wrote a nice blog
> post
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffritshoogland.wordpress.com%2F2013%2F08%2F30%2Foracle-io-on-linux-log-writer-io-and-wait-events%2F&data=02%7C01%7Cherringd%40dnb.com%7Cf5bf752ea83c4c4c36fc08d74bc21067%7C19e2b708bf12437597198dec42771b3e%7C0%7C1%7C637061169179075851&sdata=9yHLvzg3kvHMZdsz0FhRs4%2B3sRVKyFAtlF64P3n%2F8%2Bo%3D&reserved=0>
> about it.
>
> It's important to understand that it represents multiple IOs (that's the
> parallel).
>
>
>
> > "EMC and sysadmins have confirmed there are no disk errors and from
> their standpoint the disks are waiting on Oracle."
>
> I assume you have a (or two) FiberChannel SAN which connects EMS and your
> DB-host. Please ask them for measurements on those switches also.
>
> The argument is simple: If the host claims it waits on the disks
> (according to iostat) and EMC claims it's waiting on Oracle, have a closer
> look at the components in between.
>
>
>
> hth,
>
> Martin
>
>
>
>
>
> Am Mo., 7. Okt. 2019 um 17:20 Uhr schrieb Herring, David <
> dmarc-noreply_at_freelists.org>:
>
> Folks, I've got a bit of a mystery with a particular db where we're
> getting a periodic 25-30 pause between user sessions and LGWR processes and
> can't clearly identify what's the cause.
>
>
>
> - The database is 11.2.0.4, RHEL 7.5, running ASM on EMC.
> - Sometimes once a day, sometimes more (never more than 5) times a day
> we see user processes start waiting on "log file sync". LGWR is waiting on
> "log file parallel write".
> - At the same time one of the emcpower* devices shows 100% busy and
> service time 200+ (from iostat via osw). mpstat shows 1 CPU at 100% on
> iowait. It's not always the same disk (emcpowere1, a1, h1, …), not always
> the same CPU. EMC and sysadmins have confirmed there are no disk errors
> and from their standpoint the disks are waiting on Oracle.
> - During this time LGWR stats in ASH are all 0 - TIME_WAITED, DELTA*
> columns. Only after the problem goes away (about 25 secs) these columns
> are populated again, obviously the DELTA* columns 1 row later. LGWR's
> session state is WAITING so I assume the column value observations are due
> to LGWR waiting, as it won't write stats until it can do something.
>
>
>
> I am stuck trying to find out, really prove who is the culprit or what
> exactly the wait is on. Is LGWR waiting on user sessions and user sessions
> are waiting on LGWR and all that causes the disk to be 100%? Can I enable
> some sort of tracing on LGWR and would that point to exactly what he's
> waiting on to prove where the problem is?
>
>
>
> Regards,
>
>
>
> Dave
>
> --
Jared Still
Certifiable Oracle DBA and Part Time Perl Evangelist Principal Consultant at Pythian
Pythian Blog http://www.pythian.com/blog/author/still/ Github: https://github.com/jkstill











--
http://www.freelists.org/webpage/oracle-l



Received on Fri Oct 18 2019 - 17:02:32 CEST

Original text of this message