Re: Difference in writer response time

From: Pap <oracle.developer35_at_gmail.com>
Date: Fri, 9 Apr 2021 19:54:24 +0530
Message-ID: <CAEjw_fg5ko5CAXmp7tHQKh21SseXwUYjA_PwXF7R=Qnd3CSsDg_at_mail.gmail.com>



But isn't it correct to say that system stats or dictionary stats can only influence the optimizer to go for a different execution path? But in our case both of the databases are giving an equal execution path, we don't see any difference in execution path for any such sql which can cause this issue. It's mainly the single row inserts/updates are running longer on DR as compared to primary during peak period.

Regards
Pap

On Fri, Apr 9, 2021 at 7:50 PM Athil Meer <athildba_at_gmail.com> wrote:

> Hello Pap,
> Are there any difference in fixed, dictionary and system stats between
> these DBs
>
> Thank you
> Athil
>
> On Fri, 9 Apr 2021, 5:09 pm Pap, <oracle.developer35_at_gmail.com> wrote:
>
>> We have both application server and db server based out of primary and DR
>> side each. So when we point to DR our application code/server also runs
>> from DR only. So there should not be any network delay causing such
>> difference.
>>
>> Regards
>> Pap
>>
>> On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com> wrote:
>>
>>> In a network sense, where is the client application with respect to the
>>> two databases?
>>>
>>>
>>>
>>> When you are using the “primary” database is the client at the “primary”
>>> database location?
>>>
>>> When you are using the “DR” database is the client at the “DR” database
>>> location?
>>>
>>>
>>>
>>> Depending on how the application handshakes information to be written
>>> and how waits are rolled up, that can make all the difference.
>>>
>>>
>>>
>>> A very quick test is to actually place a client at the location of the
>>> “DR” database location on the LAN with the “DR” database if in fact
>>> currently you are considering performance of an application on the
>>> “primary” database location in both cases currently.
>>>
>>>
>>>
>>> Explaining how and why the performance difference can actually be
>>> network latency and speed yet appear to be differences in DBWR service rate
>>> is beyond the scope of this note.
>>>
>>>
>>>
>>> mwf
>>>
>>>
>>>
>>> *From:* oracle-l-bounce_at_freelists.org [mailto:
>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap
>>> *Sent:* Friday, April 09, 2021 8:29 AM
>>> *To:* Tanel Poder
>>> *Cc:* Oracle L
>>> *Subject:* Re: Difference in writer response time
>>>
>>>
>>>
>>> Thank You Tanel.
>>>
>>>
>>>
>>> Not having much understanding with respect to the infra side. But ,
>>> actually the first thing we want to understand is why there is a difference
>>> in performance during that peak period between the two sites. As those are
>>> primary and DR so ideally everything should be the same including SAN
>>> configurations. We have verified all the DB parameters are mostly the same
>>> and are not suspected such as to impact DBWR throughput one vs other.
>>>
>>>
>>>
>>> But as you said there may be some issue wrt kernel or IO stack, so
>>> wanted to understand what exact information we can ask from our storage/SAN
>>> team by which we can conclude is there is really a difference in
>>> performance of storage IO between both sides?
>>>
>>>
>>>
>>> And should we ask the IO information only during the issue period to
>>> understand the difference as because in normal time both the databases
>>> perform similar just that during those peak activity the DR is behaving
>>> badly as compared to primary?
>>>
>>>
>>>
>>> Regards
>>>
>>> Pap
>>>
>>>
>>>
>>> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>>>
>>> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled
>>> nowadays with "servers" like this one
>>> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
>>> under my desk ;-)
>>>
>>>
>>>
>>> Some comments:
>>>
>>> 1. The *free buffer waits* show that DBWR falls behind with syncing
>>> dirty buffers to disk (~4000 blocks written per second). So it looks like a
>>> DBWR throughput problem (caused by some issue in the kernel or I/O stack)
>>> 2. Other than storage hardware latency itself and things like block
>>> device queue depths, the filesystem itself may be a bottleneck - are you
>>> using *VxFS *- and if yes, with or without *ODM*? VxFS without ODM
>>> doesn't use concurrent I/O by default, potentially causing write-write
>>> contention at file level (POSIX single writer locking)
>>> 3. Also, the normal recommendation for production systems is to use
>>> filesystemio_options=*setall*, but I don't remember if HP-UX was
>>> special in that sense (or you just forced direct I/O with a VxFS mount
>>> option)
>>> 4. Run Snapper on DBWRs and see how many of them you have in each
>>> instance and are they 100% busy waiting for one of the *db file
>>> parallel write*, *db file asynch I/O submit* waits (or show all CPU
>>> usage - no wait events whatsoever, can be due to an instrumentation bug)
>>>
>>> The various "higher level contention" wait events, *buffer busy waits*, *enq:
>>> TX contention *and in rare cases even some latch contention may be just
>>> a symptom of the underlying I/O problems. You need to do wait chain
>>> analysis to see beyond the "top level symptoms".
>>>
>>>
>>>
>>> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't allow
>>> me to send screenshots as images, so here are a couple of tweets with
>>> images of such problems (where the high level symptoms show different waits
>>> than the root cause):
>>>
>>> - https://twitter.com/TanelPoder/status/1318023393304825856
>>> - https://twitter.com/TanelPoder/status/1380304923678498820
>>>
>>>
>>>
>>> Back when I used to deal with HP-UX/Solaris machines with VxFS more, a
>>> really common bottleneck in high-performance databases (of that era) was
>>> people not using concurrent I/O on that filesystem.
>>>
>>>
>>>
>>> --
>>>
>>> Tanel Poder
>>>
>>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>>>
>>> https://tanelpoder.com/conference/
>>>
>>>
>>>
>>> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com> wrote:
>>>
>>> Thank you Tanel.
>>>
>>>
>>>
>>> Its HP machine and its on file system. No ASM. The filesystemsio_options
>>> in gv$paramer showing as "asynch" in both databases. and also
>>> disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX IA(64
>>> bit), 64 core, 8- socket machine in both side.
>>>
>>>
>>>
>>> The SAN team confirms not much visible difference between the underlying
>>> storage performance between those two sites. So wondering as we don't use
>>> data guards could this difference be just because of just the structural
>>> difference in table/index structure? But we have all the same volume of
>>> data on both sides so not sure how that can play such a big difference. And
>>> contrary to that, I see we are doing more work on the Primary(faster) side
>>> during that time period as compared to DR(slow side), so how is that
>>> possible? I am seeing the direct write is higher on DR i.e. slow side as
>>> compared to primary, not sure if that is pointing to something suspicious.
>>>
>>>
>>>
>>> I have fetched the wait event section of the AWR from both databases and
>>> attaching here(in two tabs as primary and DR wait profile). Although I do
>>> see "free buffer waits" in one of the top lists but the major one is 'index
>>> contention'(top one is a composite PK index but local index, so less
>>> chances of structural difference causing such issues). But again it might
>>> be that the DBWR response trend is just a symptom of something else but not
>>> the cause as the SAN team is saying no difference there.
>>>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>> Pap
>>>
>>>
>>>
>>> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com>
>>> wrote:
>>>
>>> There's also one usual suspect, I don't see *filesystemio_options*
>>> parameter set, are you running on ASM, not a filesystem, right?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Tanel.
>>>
>>>
>>>
>>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com> wrote:
>>>
>>> Hello Lister, We have two databases which are the same in all the
>>> hardware configurations along with DB parameters and are in sync/replicated
>>> using golden gate. And those are treated as primary and DR for us and both
>>> are active. But we are seeing differences in behaviour when our application
>>> points to primary VS when it points to DR. We are seeing during specific
>>> times of the day when our system is at its peak, in one of the database i.e
>>> DR the DBWR response times spikes till ~200+ms slowing down the write
>>> queries while in another database the dbwr response time stays ~65ms for a
>>> similar amount of data and transaction volume. So wanted to understand what
>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>>>
>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Apr 09 2021 - 16:24:24 CEST

Original text of this message