Re: Difference in writer response time

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Thu, 8 Apr 2021 20:02:26 -0400
Message-ID: <CAMHX9JLQsU0j6BmN1jOLQv1F7sjc4EYep5yQsWh9pyPXFJR_LQ_at_mail.gmail.com>



Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled nowadays with "servers" like this one
<
https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/> under my desk ;-)

Some comments:

  1. The *free buffer waits* show that DBWR falls behind with syncing dirty buffers to disk (~4000 blocks written per second). So it looks like a DBWR throughput problem (caused by some issue in the kernel or I/O stack)
  2. Other than storage hardware latency itself and things like block device queue depths, the filesystem itself may be a bottleneck - are you using *VxFS *- and if yes, with or without *ODM*? VxFS without ODM doesn't use concurrent I/O by default, potentially causing write-write contention at file level (POSIX single writer locking)
  3. Also, the normal recommendation for production systems is to use filesystemio_options=*setall*, but I don't remember if HP-UX was special in that sense (or you just forced direct I/O with a VxFS mount option)
  4. Run Snapper on DBWRs and see how many of them you have in each instance and are they 100% busy waiting for one of the *db file parallel write*, *db file asynch I/O submit* waits (or show all CPU usage - no wait events whatsoever, can be due to an instrumentation bug)

The various "higher level contention" wait events, *buffer busy waits*, *enq: TX contention *and in rare cases even some latch contention may be just a symptom of the underlying I/O problems. You need to do wait chain analysis to see beyond the "top level symptoms".

The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't allow me to send screenshots as images, so here are a couple of tweets with images of such problems (where the high level symptoms show different waits than the root cause):

Back when I used to deal with HP-UX/Solaris machines with VxFS more, a really common bottleneck in high-performance databases (of that era) was people not using concurrent I/O on that filesystem.

--
Tanel Poder
#vConf2021: Troubleshooting Very Complex Oracle Performance Problems
https://tanelpoder.com/conference/

On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com> wrote:


> Thank you Tanel.
>
> Its HP machine and its on file system. No ASM. The filesystemsio_options
> in gv$paramer showing as "asynch" in both databases. and also
> disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX IA(64
> bit), 64 core, 8- socket machine in both side.
>
> The SAN team confirms not much visible difference between the underlying
> storage performance between those two sites. So wondering as we don't use
> data guards could this difference be just because of just the structural
> difference in table/index structure? But we have all the same volume of
> data on both sides so not sure how that can play such a big difference. And
> contrary to that, I see we are doing more work on the Primary(faster) side
> during that time period as compared to DR(slow side), so how is that
> possible? I am seeing the direct write is higher on DR i.e. slow side as
> compared to primary, not sure if that is pointing to something suspicious.
>
> I have fetched the wait event section of the AWR from both databases and
> attaching here(in two tabs as primary and DR wait profile). Although I do
> see "free buffer waits" in one of the top lists but the major one is 'index
> contention'(top one is a composite PK index but local index, so less
> chances of structural difference causing such issues). But again it might
> be that the DBWR response trend is just a symptom of something else but not
> the cause as the SAN team is saying no difference there.
>
>
> Regards
> Pap
>
> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>
>> There's also one usual suspect, I don't see *filesystemio_options*
>> parameter set, are you running on ASM, not a filesystem, right?
>>
>> Thanks,
>> Tanel.
>>
>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com> wrote:
>>
>>> Hello Lister, We have two databases which are the same in all the
>>> hardware configurations along with DB parameters and are in sync/replicated
>>> using golden gate. And those are treated as primary and DR for us and both
>>> are active. But we are seeing differences in behaviour when our application
>>> points to primary VS when it points to DR. We are seeing during specific
>>> times of the day when our system is at its peak, in one of the database i.e
>>> DR the DBWR response times spikes till ~200+ms slowing down the write
>>> queries while in another database the dbwr response time stays ~65ms for a
>>> similar amount of data and transaction volume. So wanted to understand what
>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>>>
>>
-- http://www.freelists.org/webpage/oracle-l
Received on Fri Apr 09 2021 - 02:02:26 CEST

Original text of this message