Re: Difference in writer response time

From: Pap <oracle.developer35_at_gmail.com>
Date: Thu, 15 Apr 2021 23:53:36 +0530
Message-ID: <CAEjw_fhi-B77oP6f_v+9+39nvvXr5Mo8O6V7e31CbxvsbsFSyQ_at_mail.gmail.com>



Even though the filesystemio_options set as ASYNCH but i see in v$iostat_file the column ASYNCH_IO is set as ''ASYNCH_OFF" for all the data files. So is it because we are not having data guard configuration and thus to avoid any block corruption, it's been set as "ASYNCH_OFF" but setting this ON will increase the DB writer speed?

Saw one blog stating keeping db_writer_processes=1, dbwr_io_slaves=32 helped improve the dbwriter response significantly.

https://www.bobbydurrettdba.com/2013/04/26/db_writer_processes-dbwr_io_slaves-with-no-asynch-io-on-hp-ux/

Regards
Pap

On Wed, Apr 14, 2021 at 9:42 PM Pap <oracle.developer35_at_gmail.com> wrote:

> Thanks a lot, Tanel.
>
> I got the SAN IO statistics report from the infra team for two of the days
> where application was pointing and was experiencing issues in DR but not in
> primary. What i see is the IO rate(i.e. read/sec) in the attached
> file(primary_DR_SAN_Stats.xlsx) , seems to be on higher side when
> application was pointing to DR, but in both the cases i don't see any
> increase in avserv i.e. average service time , so should we conclude that
> , even we have some additional processes might be running on DR but should
> not matter as long as the IO service time is almost zero and also no
> queuing is observed from SAN standpoint for DR? But yes this does not match
> with the picture which AWR shows i.e. DBWR response time slowness in DR as
> compared to primary.
>
> Additionally in the AWR which i had shared here earlier in this thread,
> for DR as it shows "index contention" as the top most wait and there does
> exist "ITL '' waits(in enqueu_activity_dr.xlsx) , so should we increase the
> INITRANS value to help us improve the situation here on DR(and also in
> primary) without any negative impact?
>
> Regards
> Pap
>
>
> On Sun, Apr 11, 2021 at 7:36 AM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>
>> The primary and DR run on different storage arrays that may be configured
>> and "sized" differently (sometimes DR/backup has slower disks or something
>> like that). Or even if the storage arrays, HBAs & SAN network are identical
>> across both environments, the DR storage array may be overloaded more than
>> primary (as everyone's doing their backups from DR).
>>
>> On HP-UX as a starting point, you could use some IO monitoring tool that
>> (like sar) that shows you both the I/O rate this server is running & the
>> average service time (avserv) and avwait/avque (queue lengths). So, if the
>> DR has twice higher I/O service time than primary, while the I/O rate is
>> roughly the same (as AWR metrics indicated), then this should be a good
>> starting point for sysadmins/storage folks.
>>
>> $ sar –d 1 10
>>
>> HP-UX server01 B.11.23 U ia64 10/25/09
>>
>> 20:38:46 device %busy avque r+w/s blks/s avwait avserv
>> 20:38:56 c0t0d0 1.30 0.50 3 50 0.00 11.07
>> c31t2d2 0.80 0.50 1 14 0.00 9.22
>> c31t2d0 3.30 0.50 5 96 0.00 6.14
>> c31t2d1 3.80 0.50 5 85 0.00 7.30
>> c31t2d4 1.20 0.50 2 102 0.00 6.04
>> c31t2d6 0.10 0.50 0 5 0.00 3.83
>> c31t2d3 2.80 0.50 6 96 0.00 4.67
>> c31t2d5 1.10 0.50 1 22 0.00 8.44
>> c31t2d7 0.40 0.50 1 13 0.00 4.90
>> c31t3d1 0.50 0.50 1 11 0.00 7.95
>>
>> --
>> Tanel Poder
>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>> https://tanelpoder.com/conference/
>>
>> On Sat, Apr 10, 2021 at 3:54 PM Pap <oracle.developer35_at_gmail.com> wrote:
>>
>>> As it almost seems like the DR side is performing badly because of some
>>> additional process/database might be there which is still running and
>>> adding overhead. So what kind of information/stats from the host or
>>> SAN should we ask which will help us in identifying the contribution of
>>> each of the processes in primary and DR and this would be able to find the
>>> exact culprit which is creating a difference during that peak period?
>>>
>>> Regards
>>> Pap
>>>
>>> On Sat, Apr 10, 2021 at 12:48 AM Pap <oracle.developer35_at_gmail.com>
>>> wrote:
>>>
>>>> Got to know there are other 3 databases on this box, however at exactly
>>>> same time we also see a small spike in top activity graph in another
>>>> database. And i see even when we were live on Primary that other database
>>>> on DR having that small spike exists , which suggests some problem
>>>> persisting in the box level rather in the impacted database which we are
>>>> looking for. So it may be either at host or storage(which the SAN team is
>>>> denying though).
>>>>
>>>> And another thing is even we saw in the AWR we are doing large amount
>>>> of work or the db block changes were on higher side on primary as compared
>>>> to DR , but the SAN team saying higher IOPS in DR as compared to primary
>>>> during that time observed without any queuing. So it sounds strange. So,
>>>> Just like database AWR report, Is there any such historical report which we
>>>> can generate out of the HP host or underlying SAN to understand what exact
>>>> process/database contributes how much during that peak time? And what is
>>>> the exact IO latency trend and segregation based on database , during that
>>>> time?
>>>>
>>>> Regards
>>>> Pap
>>>>
>>>> On Fri, Apr 9, 2021 at 7:54 PM Pap <oracle.developer35_at_gmail.com>
>>>> wrote:
>>>>
>>>>> But isn't it correct to say that system stats or dictionary stats can
>>>>> only influence the optimizer to go for a different execution path? But in
>>>>> our case both of the databases are giving an equal execution path, we don't
>>>>> see any difference in execution path for any such sql which can cause this
>>>>> issue. It's mainly the single row inserts/updates are running longer on DR
>>>>> as compared to primary during peak period.
>>>>>
>>>>> Regards
>>>>> Pap
>>>>>
>>>>> On Fri, Apr 9, 2021 at 7:50 PM Athil Meer <athildba_at_gmail.com> wrote:
>>>>>
>>>>>> Hello Pap,
>>>>>> Are there any difference in fixed, dictionary and system stats
>>>>>> between these DBs
>>>>>>
>>>>>> Thank you
>>>>>> Athil
>>>>>>
>>>>>> On Fri, 9 Apr 2021, 5:09 pm Pap, <oracle.developer35_at_gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We have both application server and db server based out of primary
>>>>>>> and DR side each. So when we point to DR our application code/server also
>>>>>>> runs from DR only. So there should not be any network delay causing such
>>>>>>> difference.
>>>>>>>
>>>>>>> Regards
>>>>>>> Pap
>>>>>>>
>>>>>>> On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com> wrote:
>>>>>>>
>>>>>>>> In a network sense, where is the client application with respect to
>>>>>>>> the two databases?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> When you are using the “primary” database is the client at the
>>>>>>>> “primary” database location?
>>>>>>>>
>>>>>>>> When you are using the “DR” database is the client at the “DR”
>>>>>>>> database location?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Depending on how the application handshakes information to be
>>>>>>>> written and how waits are rolled up, that can make all the difference.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> A very quick test is to actually place a client at the location of
>>>>>>>> the “DR” database location on the LAN with the “DR” database if in fact
>>>>>>>> currently you are considering performance of an application on the
>>>>>>>> “primary” database location in both cases currently.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Explaining how and why the performance difference can actually be
>>>>>>>> network latency and speed yet appear to be differences in DBWR service rate
>>>>>>>> is beyond the scope of this note.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> mwf
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* oracle-l-bounce_at_freelists.org [mailto:
>>>>>>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap
>>>>>>>> *Sent:* Friday, April 09, 2021 8:29 AM
>>>>>>>> *To:* Tanel Poder
>>>>>>>> *Cc:* Oracle L
>>>>>>>> *Subject:* Re: Difference in writer response time
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank You Tanel.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Not having much understanding with respect to the infra side. But ,
>>>>>>>> actually the first thing we want to understand is why there is a difference
>>>>>>>> in performance during that peak period between the two sites. As those are
>>>>>>>> primary and DR so ideally everything should be the same including SAN
>>>>>>>> configurations. We have verified all the DB parameters are mostly the same
>>>>>>>> and are not suspected such as to impact DBWR throughput one vs other.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> But as you said there may be some issue wrt kernel or IO stack, so
>>>>>>>> wanted to understand what exact information we can ask from our storage/SAN
>>>>>>>> team by which we can conclude is there is really a difference in
>>>>>>>> performance of storage IO between both sides?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> And should we ask the IO information only during the issue period
>>>>>>>> to understand the difference as because in normal time both the databases
>>>>>>>> perform similar just that during those peak activity the DR is behaving
>>>>>>>> badly as compared to primary?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Pap
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too
>>>>>>>> spoiled nowadays with "servers" like this one
>>>>>>>> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
>>>>>>>> under my desk ;-)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Some comments:
>>>>>>>>
>>>>>>>> 1. The *free buffer waits* show that DBWR falls behind with
>>>>>>>> syncing dirty buffers to disk (~4000 blocks written per second). So
>>>>>>>> it looks like a DBWR throughput problem (caused by some issue in the kernel
>>>>>>>> or I/O stack)
>>>>>>>> 2. Other than storage hardware latency itself and things like
>>>>>>>> block device queue depths, the filesystem itself may be a bottleneck - are
>>>>>>>> you using *VxFS *- and if yes, with or without *ODM*? VxFS
>>>>>>>> without ODM doesn't use concurrent I/O by default, potentially causing
>>>>>>>> write-write contention at file level (POSIX single writer locking)
>>>>>>>> 3. Also, the normal recommendation for production systems is to
>>>>>>>> use filesystemio_options=*setall*, but I don't remember if
>>>>>>>> HP-UX was special in that sense (or you just forced direct I/O with a VxFS
>>>>>>>> mount option)
>>>>>>>> 4. Run Snapper on DBWRs and see how many of them you have in
>>>>>>>> each instance and are they 100% busy waiting for one of the *db
>>>>>>>> file parallel write*, *db file asynch I/O submit* waits (or
>>>>>>>> show all CPU usage - no wait events whatsoever, can be due to an
>>>>>>>> instrumentation bug)
>>>>>>>>
>>>>>>>> The various "higher level contention" wait events, *buffer busy
>>>>>>>> waits*, *enq: TX contention *and in rare cases even some latch
>>>>>>>> contention may be just a symptom of the underlying I/O problems. You need
>>>>>>>> to do wait chain analysis to see beyond the "top level symptoms".
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't
>>>>>>>> allow me to send screenshots as images, so here are a couple of tweets with
>>>>>>>> images of such problems (where the high level symptoms show different waits
>>>>>>>> than the root cause):
>>>>>>>>
>>>>>>>> - https://twitter.com/TanelPoder/status/1318023393304825856
>>>>>>>> - https://twitter.com/TanelPoder/status/1380304923678498820
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Back when I used to deal with HP-UX/Solaris machines with VxFS
>>>>>>>> more, a really common bottleneck in high-performance databases (of that
>>>>>>>> era) was people not using concurrent I/O on that filesystem.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Tanel Poder
>>>>>>>>
>>>>>>>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>>>>>>>>
>>>>>>>> https://tanelpoder.com/conference/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thank you Tanel.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Its HP machine and its on file system. No ASM. The
>>>>>>>> filesystemsio_options in gv$paramer showing as "asynch" in both databases.
>>>>>>>> and also disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX
>>>>>>>> IA(64 bit), 64 core, 8- socket machine in both side.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The SAN team confirms not much visible difference between the
>>>>>>>> underlying storage performance between those two sites. So wondering as we
>>>>>>>> don't use data guards could this difference be just because of just the
>>>>>>>> structural difference in table/index structure? But we have all the same
>>>>>>>> volume of data on both sides so not sure how that can play such a big
>>>>>>>> difference. And contrary to that, I see we are doing more work on the
>>>>>>>> Primary(faster) side during that time period as compared to DR(slow side),
>>>>>>>> so how is that possible? I am seeing the direct write is higher on DR i.e.
>>>>>>>> slow side as compared to primary, not sure if that is pointing to something
>>>>>>>> suspicious.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have fetched the wait event section of the AWR from both
>>>>>>>> databases and attaching here(in two tabs as primary and DR wait profile).
>>>>>>>> Although I do see "free buffer waits" in one of the top lists but the major
>>>>>>>> one is 'index contention'(top one is a composite PK index but local index,
>>>>>>>> so less chances of structural difference causing such issues). But again it
>>>>>>>> might be that the DBWR response trend is just a symptom of something else
>>>>>>>> but not the cause as the SAN team is saying no difference there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Pap
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> There's also one usual suspect, I don't see *filesystemio_options*
>>>>>>>> parameter set, are you running on ASM, not a filesystem, right?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Tanel.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello Lister, We have two databases which are the same in all the
>>>>>>>> hardware configurations along with DB parameters and are in sync/replicated
>>>>>>>> using golden gate. And those are treated as primary and DR for us and both
>>>>>>>> are active. But we are seeing differences in behaviour when our application
>>>>>>>> points to primary VS when it points to DR. We are seeing during specific
>>>>>>>> times of the day when our system is at its peak, in one of the database i.e
>>>>>>>> DR the DBWR response times spikes till ~200+ms slowing down the write
>>>>>>>> queries while in another database the dbwr response time stays ~65ms for a
>>>>>>>> similar amount of data and transaction volume. So wanted to understand what
>>>>>>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>>>>>>>>
>>>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Apr 15 2021 - 20:23:36 CEST

Original text of this message