Re: Difference in writer response time

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Thu, 8 Apr 2021 20:02:26 -0400
Message-ID: <CAMHX9JLQsU0j6BmN1jOLQv1F7sjc4EYep5yQsWh9pyPXFJR_LQ_at_mail.gmail.com>

Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled nowadays with "servers" like this one
<https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/> under my desk ;-)

Some comments:

The *free buffer waits* show that DBWR falls behind with syncing dirty buffers to disk (~4000 blocks written per second). So it looks like a DBWR throughput problem (caused by some issue in the kernel or I/O stack)
Other than storage hardware latency itself and things like block device queue depths, the filesystem itself may be a bottleneck - are you using *VxFS *- and if yes, with or without *ODM*? VxFS without ODM doesn't use concurrent I/O by default, potentially causing write-write contention at file level (POSIX single writer locking)
Also, the normal recommendation for production systems is to use filesystemio_options=*setall*, but I don't remember if HP-UX was special in that sense (or you just forced direct I/O with a VxFS mount option)
Run Snapper on DBWRs and see how many of them you have in each instance and are they 100% busy waiting for one of the *db file parallel write*, *db file asynch I/O submit* waits (or show all CPU usage - no wait events whatsoever, can be due to an instrumentation bug)

The various "higher level contention" wait events, *buffer busy waits*, *enq: TX contention *and in rare cases even some latch contention may be just a symptom of the underlying I/O problems. You need to do wait chain analysis to see beyond the "top level symptoms".

The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't allow me to send screenshots as images, so here are a couple of tweets with images of such problems (where the high level symptoms show different waits than the root cause):

Back when I used to deal with HP-UX/Solaris machines with VxFS more, a really common bottleneck in high-performance databases (of that era) was people not using concurrent I/O on that filesystem.

--
Tanel Poder
#vConf2021: Troubleshooting Very Complex Oracle Performance Problems
https://tanelpoder.com/conference/

On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com> wrote:



> Thank you Tanel.


>

> Its HP machine and its on file system. No ASM. The filesystemsio_options

> in gv$paramer showing as "asynch" in both databases. and also

> disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX IA(64

> bit), 64 core, 8- socket machine in both side.

>

> The SAN team confirms not much visible difference between the underlying

> storage performance between those two sites. So wondering as we don't use

> data guards could this difference be just because of just the structural

> difference in table/index structure? But we have all the same volume of

> data on both sides so not sure how that can play such a big difference. And

> contrary to that, I see we are doing more work on the Primary(faster) side

> during that time period as compared to DR(slow side), so how is that

> possible? I am seeing the direct write is higher on DR i.e. slow side as

> compared to primary, not sure if that is pointing to something suspicious.

>

> I have fetched the wait event section of the AWR from both databases and

> attaching here(in two tabs as primary and DR wait profile). Although I do

> see "free buffer waits" in one of the top lists but the major one is 'index

> contention'(top one is a composite PK index but local index, so less

> chances of structural difference causing such issues). But again it might

> be that the DBWR response trend is just a symptom of something else but not

> the cause as the SAN team is saying no difference there.

>

>

> Regards

> Pap

>

> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:

>

>> There's also one usual suspect, I don't see *filesystemio_options*

>> parameter set, are you running on ASM, not a filesystem, right?

>>

>> Thanks,

>> Tanel.

>>

>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com> wrote:

>>

>>> Hello Lister, We have two databases which are the same in all the

>>> hardware configurations along with DB parameters and are in sync/replicated

>>> using golden gate. And those are treated as primary and DR for us and both

>>> are active. But we are seeing differences in behaviour when our application

>>> points to primary VS when it points to DR. We are seeing during specific

>>> times of the day when our system is at its peak, in one of the database i.e

>>> DR the DBWR response times spikes till ~200+ms slowing down the write

>>> queries while in another database the dbwr response time stays ~65ms for a

>>> similar amount of data and transaction volume. So wanted to understand what

>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.

>>>

>>


--
http://www.freelists.org/webpage/oracle-l

Received on Fri Apr 09 2021 - 02:02:26 CEST

This message: [ Message body ]
Next message: Pap: "Re: Difference in writer response time"
Previous message: Tanel Poder: "Re: Difference in writer response time"
In reply to: Pap: "Re: Difference in writer response time"
In reply to Pap: "Re: Difference in writer response time"
Next in thread: Pap: "Re: Difference in writer response time"
Reply: Pap: "Re: Difference in writer response time"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message