Re: LGWR, EMC or app cursors?

From: Ls Cheng <exriscer_at_gmail.com>
Date: Sat, 12 Oct 2019 16:24:57 +0200
Message-ID: <CAJ2-Qb8p6QnrqrvOfsWS4JpFA29RPaA0sqV3Jcg0pHCE4ZB9gA_at_mail.gmail.com>





Hi

Have you checked the number of IOPS? Perhaps you could snap v$sysmetric or v$sysstat every 30 seconds for a day and analyze when the problem happens.

BR

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Thu, Oct 10, 2019 at 7:02 PM Herring, David <HerringD_at_dnb.com> wrote:

> The code in question is probably going to raise some eyebrows but here
> goes since I knew I'd have to explain at some point (and then give the
> excuse it's a 3rd party app):
>
>
>
> - The app can run from any number of app servers, in our case 3 are
> needed. Each app server issues heartbeats and when the heartbeat appears
> to be separated by 30 sec or more, it assumes a lost connection and
> restarts the app server. The check previously was 15 sec. but I got them
> to adjust to 30 sec.
> - The heartbeat sessions are persistent, running an update that
> uniquely identifies each app server (and hence session) with a timestamp,
> commit, wait 5 sec, the SELECT current time (within DB session) minus
> heartbeat timestamp, then repeat. I know the 2 SQL_IDs for the heartbeat
> and set tracing on those IDs. Parsing the tracefiles I see that the
> heartbeats repeat every 5 sec consistently, until the last one I the file
> that hits 25+ sec gap which causes an app restart and hence a restart of
> the DB session.
> - I don't have an extensive OLTP background but to me the heartbeat
> process is asking for trouble. The table has 1 row per app server, so in
> this case 3. I dumped the rowid details and confirmed all 3 rows are in
> the same block. This means every 5 seconds, potentially concurrent, 3
> different sessions will try to update the same block.
> - The DB is running on a 2-node RAC and heartbeat sessions may show up
> wherever (connect via scan) so we get plenty of GC waits. I would think
> they'd want a separate object per app server doing a heartbeat to (nearly)
> completely eliminate contention but again, the app isn't in my control.
> - This env was converted from MySQL and apparently worked fine there.
> There's a QA env where the heartbeat processing works fine too.
>
>
>
> I guess that's a VERY long answer. Just confirming that COMMITs, although
> quite frequent, are no more than 3 every 5 sec per app design.
>
>
>
> Regards,
>
>
>
> Dave
>
>
>
> [image: cid:image001.png_at_01D05044.5C2AEE60]
>
>
>
> *Dave Herring*
>
> DBA
>
> 103 JFK Parkway
>
> Short Hills, New Jersey 07078
>
> Mobile 630.441.4404
>
>
>
> *dnb.com <http://www.dnb.com/>*
>
>
>
> [image: cid:image002.png_at_01D05044.5C2AEE60]
> <http://www.facebook.com/DunBradstreet>[image:
> cid:image003.png_at_01D05044.5C2AEE60] <http://twitter.com/dnbus>[image:
> cid:image004.png_at_01D05044.5C2AEE60]
> <http://www.linkedin.com/company/dun-&-bradstreet>[image:
> cid:image005.png_at_01D05044.5C2AEE60]
> <http://www.youtube.com/user/DunandBrad>
>
>
>
> *From:* oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> *On
> Behalf Of *Ls Cheng
> *Sent:* Thursday, October 10, 2019 7:24 AM
> *To:* dmarc-noreply_at_freelists.org
> *Cc:* oracle-l_at_freelists.org
> *Subject:* Re: LGWR, EMC or app cursors?
>
>
>
> *CAUTION:* This email originated from outside of D&B. Please do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>
> Hi
>
>
>
> Can you try to identify the number of commits when that happens plus the
> IOPS?
>
>
>
> I had a similar problems years ago, it turns our in the code a loop ran
> 2500 commits per sec
>
>
>
>
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.com%2Fsig-email%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Dwebmail&data=02%7C01%7Cherringd%40dnb.com%7Ce06948109f034f7f372008d74d7cd85b%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637063070921709817&sdata=8Hfty12ixkrOqg9EfpPgu0kP9Kt7wrj6Fx4B2OM%2F8F0%3D&reserved=0>
>
> Virus-free. www.avast.com
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.com%2Fsig-email%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Dwebmail&data=02%7C01%7Cherringd%40dnb.com%7Ce06948109f034f7f372008d74d7cd85b%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637063070921709817&sdata=8Hfty12ixkrOqg9EfpPgu0kP9Kt7wrj6Fx4B2OM%2F8F0%3D&reserved=0>
>
>
>
> On Mon, Oct 7, 2019 at 5:20 PM Herring, David <dmarc-noreply_at_freelists.org>
> wrote:
>
> Folks, I've got a bit of a mystery with a particular db where we're
> getting a periodic 25-30 pause between user sessions and LGWR processes and
> can't clearly identify what's the cause.
>
>
>
> - The database is 11.2.0.4, RHEL 7.5, running ASM on EMC.
> - Sometimes once a day, sometimes more (never more than 5) times a day
> we see user processes start waiting on "log file sync". LGWR is waiting on
> "log file parallel write".
> - At the same time one of the emcpower* devices shows 100% busy and
> service time 200+ (from iostat via osw). mpstat shows 1 CPU at 100% on
> iowait. It's not always the same disk (emcpowere1, a1, h1, …), not always
> the same CPU. EMC and sysadmins have confirmed there are no disk errors
> and from their standpoint the disks are waiting on Oracle.
> - During this time LGWR stats in ASH are all 0 - TIME_WAITED, DELTA*
> columns. Only after the problem goes away (about 25 secs) these columns
> are populated again, obviously the DELTA* columns 1 row later. LGWR's
> session state is WAITING so I assume the column value observations are due
> to LGWR waiting, as it won't write stats until it can do something.
>
>
>
> I am stuck trying to find out, really prove who is the culprit or what
> exactly the wait is on. Is LGWR waiting on user sessions and user sessions
> are waiting on LGWR and all that causes the disk to be 100%? Can I enable
> some sort of tracing on LGWR and would that point to exactly what he's
> waiting on to prove where the problem is?
>
>
>
> Regards,
>
>
>
> Dave
>
>











--
http://www.freelists.org/webpage/oracle-l



Received on Sat Oct 12 2019 - 16:24:57 CEST

Original text of this message