RE: LGWR, EMC or app cursors?

From: Herring, David <"Herring,>
Date: Thu, 17 Oct 2019 15:33:51 +0000
Message-ID: <DM6PR02MB55641DFA5D47EADE0F645AA3D46D0_at_DM6PR02MB5564.namprd02.prod.outlook.com>



Just wanted to follow up with everyone on this issue. During the last issue bridge (that I was invited to) I emphasized how LGWR went into a wait state for the duration of the issue did we get an admission from the storage team that there *might* be a compatibility issue with the HBAs. They also agreed to decrease the timeout param on the drives from 30 to 10 sec, as Andy suggested.

Thx to all who offered their help and insight! I wish I could have run more low-level tracing but all types of standard utilities like strace, dtrace, etc. are not available to me.

Regards,

Dave

From: Andy Sayer <andysayer_at_gmail.com> Sent: Saturday, October 12, 2019 11:33 AM To: Ls Cheng <exriscer_at_gmail.com>
Cc: Herring, David <HerringD_at_DNB.com>; dmarc-noreply_at_freelists.org; oracle-l_at_freelists.org Subject: Re: LGWR, EMC or app cursors?

CAUTION: This email originated from outside of D&B. Please do not click links or open attachments unless you recognize the sender and know the content is safe.

 You definitely have a storage problem, your IOs are sometimes going missing. Your system is resubmitting them when it detects a timeout (30 seconds is a common default). I wouldn't expect other IOs to fail at the same time, it's unlikely that your storage is just not working for 30 seconds at a time (you would hope that would be super obvious). It's more likely that there's a one in a billion chance for an IO to go missing and you hit that one in a billion chance several times a day.

When this missing IO happens to LGWR you immediately see the problem because every session that does a commit wait (log file sync) will end up waiting for the same 30 seconds LGWR is waiting for. You will almost certainly be seeing the same issue with your data files, you can run queries against dba_hist_event_histogram to identify how often these occur just to prove this is happening. You can also use ASH to pinpoint exact times and exact files these occur and your storage team might be able to do something with this information.

This is certainly a problem with your storage, but there are things you can do to help manage the pain:

You can reduce your disk time out setting. Let's face it, 30 seconds is an eternity for an IO in this decade, EMC recommend 10 seconds if you are using multipathing (you are, right?) - I would say that is a maximum it should be in any circumstance. This means that when the IO gets lost, you only wait for 10 seconds before it gets retried, 10 seconds of hanging might be acceptable enough - it will probably happen exactly as often as before. You can reduce the effect on every session when LGWR hits the issue, if you accept the risk of not writing the commit redo to disk before your application moves on (i.e. if your system crashes, you might loose your newer transactions). You can do this by setting commit_wait to nowait (https://docs.oracle.com/cd/E11882_01/server.112/e40402/initparams033.htm<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fcd%2FE11882_01%2Fserver.112%2Fe40402%2Finitparams033.htm&data=02%7C01%7CHerringD%40dnb.com%7C6b93c439d68b4b58d33c08d74f31e559%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637064948056215028&sdata=0Hn%2BSMDHIi%2FXcO84e0Pf5g6J6b1IcRW65ot9j6VU3JA%3D&reserved=0>) but note there's still a few things that will always do a commit wait rather than use this parameter (non-zero rollbacks, increment a sequence beyond a value that's cached, probably other examples). This will mean that when LGWR is hanging on a long IO, your other sessions will be nonplussed (unless they do a commit wait).

Funnily enough, I'm just trying to get to the EMC website to see if there's any documentation that would show you how to identify lost IOs, EMC.com returns ERR_CONNECTION_TIMED_OUT :)

Hope that gives you some ideas,
Andy

On Sat, 12 Oct 2019 at 15:25, Ls Cheng <exriscer_at_gmail.com<mailto:exriscer_at_gmail.com>> wrote: Hi

Have you checked the number of IOPS? Perhaps you could snap v$sysmetric or v$sysstat every 30 seconds for a day and analyze when the problem happens.

BR

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.com%2Fsig-email%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Dwebmail&data=02%7C01%7CHerringD%40dnb.com%7C6b93c439d68b4b58d33c08d74f31e559%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637064948056224982&sdata=tA%2BKngFlSK62X9wbifG%2B8irWndTi3kdcQwf1puUcv3E%3D&reserved=0> Virus-free. www.avast.com<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.com%2Fsig-email%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Dwebmail&data=02%7C01%7CHerringD%40dnb.com%7C6b93c439d68b4b58d33c08d74f31e559%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637064948056224982&sdata=tA%2BKngFlSK62X9wbifG%2B8irWndTi3kdcQwf1puUcv3E%3D&reserved=0>

On Thu, Oct 10, 2019 at 7:02 PM Herring, David <HerringD_at_dnb.com<mailto:HerringD_at_dnb.com>> wrote: The code in question is probably going to raise some eyebrows but here goes since I knew I'd have to explain at some point (and then give the excuse it's a 3rd party app):

  • The app can run from any number of app servers, in our case 3 are needed. Each app server issues heartbeats and when the heartbeat appears to be separated by 30 sec or more, it assumes a lost connection and restarts the app server. The check previously was 15 sec. but I got them to adjust to 30 sec.
  • The heartbeat sessions are persistent, running an update that uniquely identifies each app server (and hence session) with a timestamp, commit, wait 5 sec, the SELECT current time (within DB session) minus heartbeat timestamp, then repeat. I know the 2 SQL_IDs for the heartbeat and set tracing on those IDs. Parsing the tracefiles I see that the heartbeats repeat every 5 sec consistently, until the last one I the file that hits 25+ sec gap which causes an app restart and hence a restart of the DB session.
  • I don't have an extensive OLTP background but to me the heartbeat process is asking for trouble. The table has 1 row per app server, so in this case 3. I dumped the rowid details and confirmed all 3 rows are in the same block. This means every 5 seconds, potentially concurrent, 3 different sessions will try to update the same block.
  • The DB is running on a 2-node RAC and heartbeat sessions may show up wherever (connect via scan) so we get plenty of GC waits. I would think they'd want a separate object per app server doing a heartbeat to (nearly) completely eliminate contention but again, the app isn't in my control.
  • This env was converted from MySQL and apparently worked fine there. There's a QA env where the heartbeat processing works fine too.

I guess that's a VERY long answer. Just confirming that COMMITs, although quite frequent, are no more than 3 every 5 sec per app design.

Regards,

Dave
From: oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org> <oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org>> On Behalf Of Ls Cheng Sent: Thursday, October 10, 2019 7:24 AM To: dmarc-noreply_at_freelists.org<mailto:dmarc-noreply_at_freelists.org> Cc: oracle-l_at_freelists.org<mailto:oracle-l_at_freelists.org> Subject: Re: LGWR, EMC or app cursors?

CAUTION: This email originated from outside of D&B. Please do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi

Can you try to identify the number of commits when that happens plus the IOPS?

I had a similar problems years ago, it turns our in the code a loop ran 2500 commits per sec

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.com%2Fsig-email%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Dwebmail&data=02%7C01%7CHerringD%40dnb.com%7C6b93c439d68b4b58d33c08d74f31e559%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637064948056244894&sdata=FAxPPikgRt1X19d%2FWjNq%2Fz0Chj0Ez4QUK3v3PHHg92Y%3D&reserved=0> Virus-free. www.avast.com<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.com%2Fsig-email%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Dwebmail&data=02%7C01%7CHerringD%40dnb.com%7C6b93c439d68b4b58d33c08d74f31e559%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637064948056244894&sdata=FAxPPikgRt1X19d%2FWjNq%2Fz0Chj0Ez4QUK3v3PHHg92Y%3D&reserved=0>

On Mon, Oct 7, 2019 at 5:20 PM Herring, David <dmarc-noreply_at_freelists.org<mailto:dmarc-noreply_at_freelists.org>> wrote: Folks, I've got a bit of a mystery with a particular db where we're getting a periodic 25-30 pause between user sessions and LGWR processes and can't clearly identify what's the cause.

  • The database is 11.2.0.4, RHEL 7.5, running ASM on EMC.
  • Sometimes once a day, sometimes more (never more than 5) times a day we see user processes start waiting on "log file sync". LGWR is waiting on "log file parallel write".
  • At the same time one of the emcpower* devices shows 100% busy and service time 200+ (from iostat via osw). mpstat shows 1 CPU at 100% on iowait. It's not always the same disk (emcpowere1, a1, h1, …), not always the same CPU. EMC and sysadmins have confirmed there are no disk errors and from their standpoint the disks are waiting on Oracle.
  • During this time LGWR stats in ASH are all 0 - TIME_WAITED, DELTA* columns. Only after the problem goes away (about 25 secs) these columns are populated again, obviously the DELTA* columns 1 row later. LGWR's session state is WAITING so I assume the column value observations are due to LGWR waiting, as it won't write stats until it can do something.

I am stuck trying to find out, really prove who is the culprit or what exactly the wait is on. Is LGWR waiting on user sessions and user sessions are waiting on LGWR and all that causes the disk to be 100%? Can I enable some sort of tracing on LGWR and would that point to exactly what he's waiting on to prove where the problem is?

Regards,

Dave

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Oct 17 2019 - 17:33:51 CEST

Original text of this message