Re: Difference in writer response time

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Sat, 10 Apr 2021 22:06:04 -0400
Message-ID: <CAMHX9JK5+_SYJ3W00K54moPNx6F0jwTzbebPKBO4fVQcVE14wg_at_mail.gmail.com>

The primary and DR run on different storage arrays that may be configured and "sized" differently (sometimes DR/backup has slower disks or something like that). Or even if the storage arrays, HBAs & SAN network are identical across both environments, the DR storage array may be overloaded more than primary (as everyone's doing their backups from DR).

On HP-UX as a starting point, you could use some IO monitoring tool that (like sar) that shows you both the I/O rate this server is running & the average service time (avserv) and avwait/avque (queue lengths). So, if the DR has twice higher I/O service time than primary, while the I/O rate is roughly the same (as AWR metrics indicated), then this should be a good starting point for sysadmins/storage folks.

$ sar –d 1 10

HP-UX server01 B.11.23 U ia64 10/25/09

20:38:46 device %busy avque r+w/s blks/s avwait avserv

20:38:56   c0t0d0    1.30    0.50       3      50    0.00   11.07
          c31t2d2    0.80    0.50       1      14    0.00    9.22
          c31t2d0    3.30    0.50       5      96    0.00    6.14
          c31t2d1    3.80    0.50       5      85    0.00    7.30
          c31t2d4    1.20    0.50       2     102    0.00    6.04
          c31t2d6    0.10    0.50       0       5    0.00    3.83
          c31t2d3    2.80    0.50       6      96    0.00    4.67
          c31t2d5    1.10    0.50       1      22    0.00    8.44
          c31t2d7    0.40    0.50       1      13    0.00    4.90
          c31t3d1    0.50    0.50       1      11    0.00    7.95

--
Tanel Poder
#vConf2021: Troubleshooting Very Complex Oracle Performance Problems
https://tanelpoder.com/conference/

On Sat, Apr 10, 2021 at 3:54 PM Pap <oracle.developer35_at_gmail.com> wrote:



> As it almost seems like the DR side is performing badly because of some


> additional process/database might be there which is still running and

> adding overhead. So what kind of information/stats from the host or

> SAN should we ask which will help us in identifying the contribution of

> each of the processes in primary and DR and this would be able to find the

> exact culprit which is creating a difference during that peak period?

>

> Regards

> Pap

>

> On Sat, Apr 10, 2021 at 12:48 AM Pap <oracle.developer35_at_gmail.com> wrote:

>

>> Got to know there are other 3 databases on this box, however at exactly

>> same time we also see a small spike in top activity graph in another

>> database. And i see even when we were live on Primary that other database

>> on DR having that small spike exists , which suggests some problem

>> persisting in the box level rather in the impacted database which we are

>> looking for. So it may be either at host or storage(which the SAN team is

>> denying though).

>>

>> And another thing is even we saw in the AWR we are doing large amount of

>> work or the db block changes were on higher side on primary as compared to

>> DR , but the SAN team saying higher IOPS in DR as compared to primary

>> during that time observed without any queuing. So it sounds strange. So,

>> Just like database AWR report, Is there any such historical report which we

>> can generate out of the HP host or underlying SAN to understand what exact

>> process/database contributes how much during that peak time? And what is

>> the exact IO latency trend and segregation based on database , during that

>> time?

>>

>> Regards

>> Pap

>>

>> On Fri, Apr 9, 2021 at 7:54 PM Pap <oracle.developer35_at_gmail.com> wrote:

>>

>>> But isn't it correct to say that system stats or dictionary stats can

>>> only influence the optimizer to go for a different execution path? But in

>>> our case both of the databases are giving an equal execution path, we don't

>>> see any difference in execution path for any such sql which can cause this

>>> issue. It's mainly the single row inserts/updates are running longer on DR

>>> as compared to primary during peak period.

>>>

>>> Regards

>>> Pap

>>>

>>> On Fri, Apr 9, 2021 at 7:50 PM Athil Meer <athildba_at_gmail.com> wrote:

>>>

>>>> Hello Pap,

>>>> Are there any difference in fixed, dictionary and system stats between

>>>> these DBs

>>>>

>>>> Thank you

>>>> Athil

>>>>

>>>> On Fri, 9 Apr 2021, 5:09 pm Pap, <oracle.developer35_at_gmail.com> wrote:

>>>>

>>>>> We have both application server and db server based out of primary and

>>>>> DR side each. So when we point to DR our application code/server also runs

>>>>> from DR only. So there should not be any network delay causing such

>>>>> difference.

>>>>>

>>>>> Regards

>>>>> Pap

>>>>>

>>>>> On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com> wrote:

>>>>>

>>>>>> In a network sense, where is the client application with respect to

>>>>>> the two databases?

>>>>>>

>>>>>>

>>>>>>

>>>>>> When you are using the “primary” database is the client at the

>>>>>> “primary” database location?

>>>>>>

>>>>>> When you are using the “DR” database is the client at the “DR”

>>>>>> database location?

>>>>>>

>>>>>>

>>>>>>

>>>>>> Depending on how the application handshakes information to be written

>>>>>> and how waits are rolled up, that can make all the difference.

>>>>>>

>>>>>>

>>>>>>

>>>>>> A very quick test is to actually place a client at the location of

>>>>>> the “DR” database location on the LAN with the “DR” database if in fact

>>>>>> currently you are considering performance of an application on the

>>>>>> “primary” database location in both cases currently.

>>>>>>

>>>>>>

>>>>>>

>>>>>> Explaining how and why the performance difference can actually be

>>>>>> network latency and speed yet appear to be differences in DBWR service rate

>>>>>> is beyond the scope of this note.

>>>>>>

>>>>>>

>>>>>>

>>>>>> mwf

>>>>>>

>>>>>>

>>>>>>

>>>>>> *From:* oracle-l-bounce_at_freelists.org [mailto:

>>>>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap

>>>>>> *Sent:* Friday, April 09, 2021 8:29 AM

>>>>>> *To:* Tanel Poder

>>>>>> *Cc:* Oracle L

>>>>>> *Subject:* Re: Difference in writer response time

>>>>>>

>>>>>>

>>>>>>

>>>>>> Thank You Tanel.

>>>>>>

>>>>>>

>>>>>>

>>>>>> Not having much understanding with respect to the infra side. But ,

>>>>>> actually the first thing we want to understand is why there is a difference

>>>>>> in performance during that peak period between the two sites. As those are

>>>>>> primary and DR so ideally everything should be the same including SAN

>>>>>> configurations. We have verified all the DB parameters are mostly the same

>>>>>> and are not suspected such as to impact DBWR throughput one vs other.

>>>>>>

>>>>>>

>>>>>>

>>>>>> But as you said there may be some issue wrt kernel or IO stack, so

>>>>>> wanted to understand what exact information we can ask from our storage/SAN

>>>>>> team by which we can conclude is there is really a difference in

>>>>>> performance of storage IO between both sides?

>>>>>>

>>>>>>

>>>>>>

>>>>>> And should we ask the IO information only during the issue period to

>>>>>> understand the difference as because in normal time both the databases

>>>>>> perform similar just that during those peak activity the DR is behaving

>>>>>> badly as compared to primary?

>>>>>>

>>>>>>

>>>>>>

>>>>>> Regards

>>>>>>

>>>>>> Pap

>>>>>>

>>>>>>

>>>>>>

>>>>>> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com>

>>>>>> wrote:

>>>>>>

>>>>>> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled

>>>>>> nowadays with "servers" like this one

>>>>>> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>

>>>>>> under my desk ;-)

>>>>>>

>>>>>>

>>>>>>

>>>>>> Some comments:

>>>>>>

>>>>>>    1. The *free buffer waits* show that DBWR falls behind with

>>>>>>    syncing dirty buffers to disk (~4000 blocks written per second). So

>>>>>>    it looks like a DBWR throughput problem (caused by some issue in the kernel

>>>>>>    or I/O stack)

>>>>>>    2. Other than storage hardware latency itself and things like

>>>>>>    block device queue depths, the filesystem itself may be a bottleneck - are

>>>>>>    you using *VxFS *- and if yes, with or without *ODM*? VxFS

>>>>>>    without ODM doesn't use concurrent I/O by default, potentially causing

>>>>>>    write-write contention at file level (POSIX single writer locking)

>>>>>>    3. Also, the normal recommendation for production systems is to

>>>>>>    use filesystemio_options=*setall*, but I don't remember if HP-UX

>>>>>>    was special in that sense (or you just forced direct I/O with a VxFS mount

>>>>>>    option)

>>>>>>    4. Run Snapper on DBWRs and see how many of them you have in each

>>>>>>    instance and are they 100% busy waiting for one of the *db file

>>>>>>    parallel write*, *db file asynch I/O submit* waits (or show all

>>>>>>    CPU usage - no wait events whatsoever, can be due to an instrumentation bug)

>>>>>>

>>>>>> The various "higher level contention" wait events, *buffer busy

>>>>>> waits*, *enq: TX contention *and in rare cases even some latch

>>>>>> contention may be just a symptom of the underlying I/O problems. You need

>>>>>> to do wait chain analysis to see beyond the "top level symptoms".

>>>>>>

>>>>>>

>>>>>>

>>>>>> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't

>>>>>> allow me to send screenshots as images, so here are a couple of tweets with

>>>>>> images of such problems (where the high level symptoms show different waits

>>>>>> than the root cause):

>>>>>>

>>>>>>    - https://twitter.com/TanelPoder/status/1318023393304825856

>>>>>>    - https://twitter.com/TanelPoder/status/1380304923678498820

>>>>>>

>>>>>>

>>>>>>

>>>>>> Back when I used to deal with HP-UX/Solaris machines with VxFS more,

>>>>>> a really common bottleneck in high-performance databases (of that era) was

>>>>>> people not using concurrent I/O on that filesystem.

>>>>>>

>>>>>>

>>>>>>

>>>>>> --

>>>>>>

>>>>>> Tanel Poder

>>>>>>

>>>>>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems

>>>>>>

>>>>>> https://tanelpoder.com/conference/

>>>>>>

>>>>>>

>>>>>>

>>>>>> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com>

>>>>>> wrote:

>>>>>>

>>>>>> Thank you Tanel.

>>>>>>

>>>>>>

>>>>>>

>>>>>> Its HP machine and its on file system. No ASM. The

>>>>>> filesystemsio_options in gv$paramer showing as "asynch" in both databases.

>>>>>> and also disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX

>>>>>> IA(64 bit), 64 core, 8- socket machine in both side.

>>>>>>

>>>>>>

>>>>>>

>>>>>> The SAN team confirms not much visible difference between the

>>>>>> underlying storage performance between those two sites. So wondering as we

>>>>>> don't use data guards could this difference be just because of just the

>>>>>> structural difference in table/index structure? But we have all the same

>>>>>> volume of data on both sides so not sure how that can play such a big

>>>>>> difference. And contrary to that, I see we are doing more work on the

>>>>>> Primary(faster) side during that time period as compared to DR(slow side),

>>>>>> so how is that possible? I am seeing the direct write is higher on DR i.e.

>>>>>> slow side as compared to primary, not sure if that is pointing to something

>>>>>> suspicious.

>>>>>>

>>>>>>

>>>>>>

>>>>>> I have fetched the wait event section of the AWR from both databases

>>>>>> and attaching here(in two tabs as primary and DR wait profile). Although I

>>>>>> do see "free buffer waits" in one of the top lists but the major one is

>>>>>> 'index contention'(top one is a composite PK index but local index, so less

>>>>>> chances of structural difference causing such issues). But again it might

>>>>>> be that the DBWR response trend is just a symptom of something else but not

>>>>>> the cause as the SAN team is saying no difference there.

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>> Regards

>>>>>>

>>>>>> Pap

>>>>>>

>>>>>>

>>>>>>

>>>>>> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com>

>>>>>> wrote:

>>>>>>

>>>>>> There's also one usual suspect, I don't see *filesystemio_options*

>>>>>> parameter set, are you running on ASM, not a filesystem, right?

>>>>>>

>>>>>>

>>>>>>

>>>>>> Thanks,

>>>>>>

>>>>>> Tanel.

>>>>>>

>>>>>>

>>>>>>

>>>>>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com>

>>>>>> wrote:

>>>>>>

>>>>>> Hello Lister, We have two databases which are the same in all the

>>>>>> hardware configurations along with DB parameters and are in sync/replicated

>>>>>> using golden gate. And those are treated as primary and DR for us and both

>>>>>> are active. But we are seeing differences in behaviour when our application

>>>>>> points to primary VS when it points to DR. We are seeing during specific

>>>>>> times of the day when our system is at its peak, in one of the database i.e

>>>>>> DR the DBWR response times spikes till ~200+ms slowing down the write

>>>>>> queries while in another database the dbwr response time stays ~65ms for a

>>>>>> similar amount of data and transaction volume. So wanted to understand what

>>>>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.

>>>>>>

>>>>>>



--
http://www.freelists.org/webpage/oracle-l

Received on Sun Apr 11 2021 - 04:06:04 CEST

This message: [ Message body ]
Next message: Lothar Flatz: "Re: sql start time"
Previous message: Pap: "Re: Difference in writer response time"
In reply to: Pap: "Re: Difference in writer response time"
In reply to Pap: "Re: Difference in writer response time"
Next in thread: Pap: "Re: Difference in writer response time"
Reply: Pap: "Re: Difference in writer response time"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message