Re: Difference in writer response time

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Sat, 10 Apr 2021 22:06:04 -0400
Message-ID: <CAMHX9JK5+_SYJ3W00K54moPNx6F0jwTzbebPKBO4fVQcVE14wg_at_mail.gmail.com>



The primary and DR run on different storage arrays that may be configured and "sized" differently (sometimes DR/backup has slower disks or something like that). Or even if the storage arrays, HBAs & SAN network are identical across both environments, the DR storage array may be overloaded more than primary (as everyone's doing their backups from DR).

On HP-UX as a starting point, you could use some IO monitoring tool that (like sar) that shows you both the I/O rate this server is running & the average service time (avserv) and avwait/avque (queue lengths). So, if the DR has twice higher I/O service time than primary, while the I/O rate is roughly the same (as AWR metrics indicated), then this should be a good starting point for sysadmins/storage folks.

$ sar –d 1 10

HP-UX server01 B.11.23 U ia64 10/25/09

20:38:46 device %busy avque r+w/s blks/s avwait avserv

20:38:56   c0t0d0    1.30    0.50       3      50    0.00   11.07
          c31t2d2    0.80    0.50       1      14    0.00    9.22
          c31t2d0    3.30    0.50       5      96    0.00    6.14
          c31t2d1    3.80    0.50       5      85    0.00    7.30
          c31t2d4    1.20    0.50       2     102    0.00    6.04
          c31t2d6    0.10    0.50       0       5    0.00    3.83
          c31t2d3    2.80    0.50       6      96    0.00    4.67
          c31t2d5    1.10    0.50       1      22    0.00    8.44
          c31t2d7    0.40    0.50       1      13    0.00    4.90
          c31t3d1    0.50    0.50       1      11    0.00    7.95

--
Tanel Poder
#vConf2021: Troubleshooting Very Complex Oracle Performance Problems
https://tanelpoder.com/conference/

On Sat, Apr 10, 2021 at 3:54 PM Pap <oracle.developer35_at_gmail.com> wrote:


> As it almost seems like the DR side is performing badly because of some
> additional process/database might be there which is still running and
> adding overhead. So what kind of information/stats from the host or
> SAN should we ask which will help us in identifying the contribution of
> each of the processes in primary and DR and this would be able to find the
> exact culprit which is creating a difference during that peak period?
>
> Regards
> Pap
>
> On Sat, Apr 10, 2021 at 12:48 AM Pap <oracle.developer35_at_gmail.com> wrote:
>
>> Got to know there are other 3 databases on this box, however at exactly
>> same time we also see a small spike in top activity graph in another
>> database. And i see even when we were live on Primary that other database
>> on DR having that small spike exists , which suggests some problem
>> persisting in the box level rather in the impacted database which we are
>> looking for. So it may be either at host or storage(which the SAN team is
>> denying though).
>>
>> And another thing is even we saw in the AWR we are doing large amount of
>> work or the db block changes were on higher side on primary as compared to
>> DR , but the SAN team saying higher IOPS in DR as compared to primary
>> during that time observed without any queuing. So it sounds strange. So,
>> Just like database AWR report, Is there any such historical report which we
>> can generate out of the HP host or underlying SAN to understand what exact
>> process/database contributes how much during that peak time? And what is
>> the exact IO latency trend and segregation based on database , during that
>> time?
>>
>> Regards
>> Pap
>>
>> On Fri, Apr 9, 2021 at 7:54 PM Pap <oracle.developer35_at_gmail.com> wrote:
>>
>>> But isn't it correct to say that system stats or dictionary stats can
>>> only influence the optimizer to go for a different execution path? But in
>>> our case both of the databases are giving an equal execution path, we don't
>>> see any difference in execution path for any such sql which can cause this
>>> issue. It's mainly the single row inserts/updates are running longer on DR
>>> as compared to primary during peak period.
>>>
>>> Regards
>>> Pap
>>>
>>> On Fri, Apr 9, 2021 at 7:50 PM Athil Meer <athildba_at_gmail.com> wrote:
>>>
>>>> Hello Pap,
>>>> Are there any difference in fixed, dictionary and system stats between
>>>> these DBs
>>>>
>>>> Thank you
>>>> Athil
>>>>
>>>> On Fri, 9 Apr 2021, 5:09 pm Pap, <oracle.developer35_at_gmail.com> wrote:
>>>>
>>>>> We have both application server and db server based out of primary and
>>>>> DR side each. So when we point to DR our application code/server also runs
>>>>> from DR only. So there should not be any network delay causing such
>>>>> difference.
>>>>>
>>>>> Regards
>>>>> Pap
>>>>>
>>>>> On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com> wrote:
>>>>>
>>>>>> In a network sense, where is the client application with respect to
>>>>>> the two databases?
>>>>>>
>>>>>>
>>>>>>
>>>>>> When you are using the “primary” database is the client at the
>>>>>> “primary” database location?
>>>>>>
>>>>>> When you are using the “DR” database is the client at the “DR”
>>>>>> database location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Depending on how the application handshakes information to be written
>>>>>> and how waits are rolled up, that can make all the difference.
>>>>>>
>>>>>>
>>>>>>
>>>>>> A very quick test is to actually place a client at the location of
>>>>>> the “DR” database location on the LAN with the “DR” database if in fact
>>>>>> currently you are considering performance of an application on the
>>>>>> “primary” database location in both cases currently.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Explaining how and why the performance difference can actually be
>>>>>> network latency and speed yet appear to be differences in DBWR service rate
>>>>>> is beyond the scope of this note.
>>>>>>
>>>>>>
>>>>>>
>>>>>> mwf
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* oracle-l-bounce_at_freelists.org [mailto:
>>>>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap
>>>>>> *Sent:* Friday, April 09, 2021 8:29 AM
>>>>>> *To:* Tanel Poder
>>>>>> *Cc:* Oracle L
>>>>>> *Subject:* Re: Difference in writer response time
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank You Tanel.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Not having much understanding with respect to the infra side. But ,
>>>>>> actually the first thing we want to understand is why there is a difference
>>>>>> in performance during that peak period between the two sites. As those are
>>>>>> primary and DR so ideally everything should be the same including SAN
>>>>>> configurations. We have verified all the DB parameters are mostly the same
>>>>>> and are not suspected such as to impact DBWR throughput one vs other.
>>>>>>
>>>>>>
>>>>>>
>>>>>> But as you said there may be some issue wrt kernel or IO stack, so
>>>>>> wanted to understand what exact information we can ask from our storage/SAN
>>>>>> team by which we can conclude is there is really a difference in
>>>>>> performance of storage IO between both sides?
>>>>>>
>>>>>>
>>>>>>
>>>>>> And should we ask the IO information only during the issue period to
>>>>>> understand the difference as because in normal time both the databases
>>>>>> perform similar just that during those peak activity the DR is behaving
>>>>>> badly as compared to primary?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Pap
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>> wrote:
>>>>>>
>>>>>> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled
>>>>>> nowadays with "servers" like this one
>>>>>> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
>>>>>> under my desk ;-)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Some comments:
>>>>>>
>>>>>> 1. The *free buffer waits* show that DBWR falls behind with
>>>>>> syncing dirty buffers to disk (~4000 blocks written per second). So
>>>>>> it looks like a DBWR throughput problem (caused by some issue in the kernel
>>>>>> or I/O stack)
>>>>>> 2. Other than storage hardware latency itself and things like
>>>>>> block device queue depths, the filesystem itself may be a bottleneck - are
>>>>>> you using *VxFS *- and if yes, with or without *ODM*? VxFS
>>>>>> without ODM doesn't use concurrent I/O by default, potentially causing
>>>>>> write-write contention at file level (POSIX single writer locking)
>>>>>> 3. Also, the normal recommendation for production systems is to
>>>>>> use filesystemio_options=*setall*, but I don't remember if HP-UX
>>>>>> was special in that sense (or you just forced direct I/O with a VxFS mount
>>>>>> option)
>>>>>> 4. Run Snapper on DBWRs and see how many of them you have in each
>>>>>> instance and are they 100% busy waiting for one of the *db file
>>>>>> parallel write*, *db file asynch I/O submit* waits (or show all
>>>>>> CPU usage - no wait events whatsoever, can be due to an instrumentation bug)
>>>>>>
>>>>>> The various "higher level contention" wait events, *buffer busy
>>>>>> waits*, *enq: TX contention *and in rare cases even some latch
>>>>>> contention may be just a symptom of the underlying I/O problems. You need
>>>>>> to do wait chain analysis to see beyond the "top level symptoms".
>>>>>>
>>>>>>
>>>>>>
>>>>>> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't
>>>>>> allow me to send screenshots as images, so here are a couple of tweets with
>>>>>> images of such problems (where the high level symptoms show different waits
>>>>>> than the root cause):
>>>>>>
>>>>>> - https://twitter.com/TanelPoder/status/1318023393304825856
>>>>>> - https://twitter.com/TanelPoder/status/1380304923678498820
>>>>>>
>>>>>>
>>>>>>
>>>>>> Back when I used to deal with HP-UX/Solaris machines with VxFS more,
>>>>>> a really common bottleneck in high-performance databases (of that era) was
>>>>>> people not using concurrent I/O on that filesystem.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Tanel Poder
>>>>>>
>>>>>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>>>>>>
>>>>>> https://tanelpoder.com/conference/
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thank you Tanel.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Its HP machine and its on file system. No ASM. The
>>>>>> filesystemsio_options in gv$paramer showing as "asynch" in both databases.
>>>>>> and also disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX
>>>>>> IA(64 bit), 64 core, 8- socket machine in both side.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The SAN team confirms not much visible difference between the
>>>>>> underlying storage performance between those two sites. So wondering as we
>>>>>> don't use data guards could this difference be just because of just the
>>>>>> structural difference in table/index structure? But we have all the same
>>>>>> volume of data on both sides so not sure how that can play such a big
>>>>>> difference. And contrary to that, I see we are doing more work on the
>>>>>> Primary(faster) side during that time period as compared to DR(slow side),
>>>>>> so how is that possible? I am seeing the direct write is higher on DR i.e.
>>>>>> slow side as compared to primary, not sure if that is pointing to something
>>>>>> suspicious.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have fetched the wait event section of the AWR from both databases
>>>>>> and attaching here(in two tabs as primary and DR wait profile). Although I
>>>>>> do see "free buffer waits" in one of the top lists but the major one is
>>>>>> 'index contention'(top one is a composite PK index but local index, so less
>>>>>> chances of structural difference causing such issues). But again it might
>>>>>> be that the DBWR response trend is just a symptom of something else but not
>>>>>> the cause as the SAN team is saying no difference there.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Pap
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>> wrote:
>>>>>>
>>>>>> There's also one usual suspect, I don't see *filesystemio_options*
>>>>>> parameter set, are you running on ASM, not a filesystem, right?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Tanel.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hello Lister, We have two databases which are the same in all the
>>>>>> hardware configurations along with DB parameters and are in sync/replicated
>>>>>> using golden gate. And those are treated as primary and DR for us and both
>>>>>> are active. But we are seeing differences in behaviour when our application
>>>>>> points to primary VS when it points to DR. We are seeing during specific
>>>>>> times of the day when our system is at its peak, in one of the database i.e
>>>>>> DR the DBWR response times spikes till ~200+ms slowing down the write
>>>>>> queries while in another database the dbwr response time stays ~65ms for a
>>>>>> similar amount of data and transaction volume. So wanted to understand what
>>>>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>>>>>>
>>>>>>
-- http://www.freelists.org/webpage/oracle-l
Received on Sun Apr 11 2021 - 04:06:04 CEST

Original text of this message