Re: Difference in writer response time

From: Pap <oracle.developer35_at_gmail.com>
Date: Fri, 9 Apr 2021 19:39:30 +0530
Message-ID: <CAEjw_fhVGeHp7rDGNJPL_yQDAfjvFz4LQY0krr=+BMnSJ7Kx=g_at_mail.gmail.com>



We have both application server and db server based out of primary and DR side each. So when we point to DR our application code/server also runs from DR only. So there should not be any network delay causing such difference.

Regards
Pap

On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com> wrote:

> In a network sense, where is the client application with respect to the
> two databases?
>
>
>
> When you are using the “primary” database is the client at the “primary”
> database location?
>
> When you are using the “DR” database is the client at the “DR” database
> location?
>
>
>
> Depending on how the application handshakes information to be written and
> how waits are rolled up, that can make all the difference.
>
>
>
> A very quick test is to actually place a client at the location of the
> “DR” database location on the LAN with the “DR” database if in fact
> currently you are considering performance of an application on the
> “primary” database location in both cases currently.
>
>
>
> Explaining how and why the performance difference can actually be network
> latency and speed yet appear to be differences in DBWR service rate is
> beyond the scope of this note.
>
>
>
> mwf
>
>
>
> *From:* oracle-l-bounce_at_freelists.org [mailto:
> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap
> *Sent:* Friday, April 09, 2021 8:29 AM
> *To:* Tanel Poder
> *Cc:* Oracle L
> *Subject:* Re: Difference in writer response time
>
>
>
> Thank You Tanel.
>
>
>
> Not having much understanding with respect to the infra side. But ,
> actually the first thing we want to understand is why there is a difference
> in performance during that peak period between the two sites. As those are
> primary and DR so ideally everything should be the same including SAN
> configurations. We have verified all the DB parameters are mostly the same
> and are not suspected such as to impact DBWR throughput one vs other.
>
>
>
> But as you said there may be some issue wrt kernel or IO stack, so wanted
> to understand what exact information we can ask from our storage/SAN team
> by which we can conclude is there is really a difference in performance of
> storage IO between both sides?
>
>
>
> And should we ask the IO information only during the issue period to
> understand the difference as because in normal time both the databases
> perform similar just that during those peak activity the DR is behaving
> badly as compared to primary?
>
>
>
> Regards
>
> Pap
>
>
>
> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>
> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled
> nowadays with "servers" like this one
> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
> under my desk ;-)
>
>
>
> Some comments:
>
> 1. The *free buffer waits* show that DBWR falls behind with syncing
> dirty buffers to disk (~4000 blocks written per second). So it looks like a
> DBWR throughput problem (caused by some issue in the kernel or I/O stack)
> 2. Other than storage hardware latency itself and things like block
> device queue depths, the filesystem itself may be a bottleneck - are you
> using *VxFS *- and if yes, with or without *ODM*? VxFS without ODM
> doesn't use concurrent I/O by default, potentially causing write-write
> contention at file level (POSIX single writer locking)
> 3. Also, the normal recommendation for production systems is to use
> filesystemio_options=*setall*, but I don't remember if HP-UX was
> special in that sense (or you just forced direct I/O with a VxFS mount
> option)
> 4. Run Snapper on DBWRs and see how many of them you have in each
> instance and are they 100% busy waiting for one of the *db file
> parallel write*, *db file asynch I/O submit* waits (or show all CPU
> usage - no wait events whatsoever, can be due to an instrumentation bug)
>
> The various "higher level contention" wait events, *buffer busy waits*, *enq:
> TX contention *and in rare cases even some latch contention may be just a
> symptom of the underlying I/O problems. You need to do wait chain analysis
> to see beyond the "top level symptoms".
>
>
>
> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't allow me
> to send screenshots as images, so here are a couple of tweets with images
> of such problems (where the high level symptoms show different waits than
> the root cause):
>
> - https://twitter.com/TanelPoder/status/1318023393304825856
> - https://twitter.com/TanelPoder/status/1380304923678498820
>
>
>
> Back when I used to deal with HP-UX/Solaris machines with VxFS more, a
> really common bottleneck in high-performance databases (of that era) was
> people not using concurrent I/O on that filesystem.
>
>
>
> --
>
> Tanel Poder
>
> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>
> https://tanelpoder.com/conference/
>
>
>
> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com> wrote:
>
> Thank you Tanel.
>
>
>
> Its HP machine and its on file system. No ASM. The filesystemsio_options
> in gv$paramer showing as "asynch" in both databases. and also
> disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX IA(64
> bit), 64 core, 8- socket machine in both side.
>
>
>
> The SAN team confirms not much visible difference between the underlying
> storage performance between those two sites. So wondering as we don't use
> data guards could this difference be just because of just the structural
> difference in table/index structure? But we have all the same volume of
> data on both sides so not sure how that can play such a big difference. And
> contrary to that, I see we are doing more work on the Primary(faster) side
> during that time period as compared to DR(slow side), so how is that
> possible? I am seeing the direct write is higher on DR i.e. slow side as
> compared to primary, not sure if that is pointing to something suspicious.
>
>
>
> I have fetched the wait event section of the AWR from both databases and
> attaching here(in two tabs as primary and DR wait profile). Although I do
> see "free buffer waits" in one of the top lists but the major one is 'index
> contention'(top one is a composite PK index but local index, so less
> chances of structural difference causing such issues). But again it might
> be that the DBWR response trend is just a symptom of something else but not
> the cause as the SAN team is saying no difference there.
>
>
>
>
>
> Regards
>
> Pap
>
>
>
> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>
> There's also one usual suspect, I don't see *filesystemio_options*
> parameter set, are you running on ASM, not a filesystem, right?
>
>
>
> Thanks,
>
> Tanel.
>
>
>
> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com> wrote:
>
> Hello Lister, We have two databases which are the same in all the hardware
> configurations along with DB parameters and are in sync/replicated using
> golden gate. And those are treated as primary and DR for us and both are
> active. But we are seeing differences in behaviour when our application
> points to primary VS when it points to DR. We are seeing during specific
> times of the day when our system is at its peak, in one of the database i.e
> DR the DBWR response times spikes till ~200+ms slowing down the write
> queries while in another database the dbwr response time stays ~65ms for a
> similar amount of data and transaction volume. So wanted to understand what
> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Apr 09 2021 - 16:09:30 CEST

Original text of this message