Re: Difference in writer response time

From: Mark J. Bobak <mark_at_bobak.net>
Date: Thu, 15 Apr 2021 15:12:11 -0400
Message-ID: <CAFQ5ACJv3OCh0aAV9WCgMzY37RHUN9YuxAZwdxk6uX2bkkX9Fw_at_mail.gmail.com>



If I recall correctly, you must have db_writer_processes=1 if you're limited by your filesystem not natively supporting asynch I/O, and setting db_io_slaves > 1 is a way to "simulate" asynch I/O for those filesystems that don't offer it. Where possible, you should get superior performance from db_writer_processes > 1.

-Mark

On Thu, Apr 15, 2021, 14:23 Pap <oracle.developer35_at_gmail.com> wrote:

> Even though the filesystemio_options set as ASYNCH but i see in
> v$iostat_file the column ASYNCH_IO is set as ''ASYNCH_OFF" for all the data
> files. So is it because we are not having data guard configuration and thus
> to avoid any block corruption, it's been set as "ASYNCH_OFF" but setting
> this ON will increase the DB writer speed?
>
> Saw one blog stating keeping db_writer_processes=1, dbwr_io_slaves=32
> helped improve the dbwriter response significantly.
>
>
> https://www.bobbydurrettdba.com/2013/04/26/db_writer_processes-dbwr_io_slaves-with-no-asynch-io-on-hp-ux/
>
> Regards
> Pap
>
> On Wed, Apr 14, 2021 at 9:42 PM Pap <oracle.developer35_at_gmail.com> wrote:
>
>> Thanks a lot, Tanel.
>>
>> I got the SAN IO statistics report from the infra team for two of the
>> days where application was pointing and was experiencing issues in DR but
>> not in primary. What i see is the IO rate(i.e. read/sec) in the attached
>> file(primary_DR_SAN_Stats.xlsx) , seems to be on higher side when
>> application was pointing to DR, but in both the cases i don't see any
>> increase in avserv i.e. average service time , so should we conclude that
>> , even we have some additional processes might be running on DR but should
>> not matter as long as the IO service time is almost zero and also no
>> queuing is observed from SAN standpoint for DR? But yes this does not match
>> with the picture which AWR shows i.e. DBWR response time slowness in DR as
>> compared to primary.
>>
>> Additionally in the AWR which i had shared here earlier in this thread,
>> for DR as it shows "index contention" as the top most wait and there does
>> exist "ITL '' waits(in enqueu_activity_dr.xlsx) , so should we increase the
>> INITRANS value to help us improve the situation here on DR(and also in
>> primary) without any negative impact?
>>
>> Regards
>> Pap
>>
>>
>> On Sun, Apr 11, 2021 at 7:36 AM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>>
>>> The primary and DR run on different storage arrays that may be
>>> configured and "sized" differently (sometimes DR/backup has slower disks or
>>> something like that). Or even if the storage arrays, HBAs & SAN network are
>>> identical across both environments, the DR storage array may be overloaded
>>> more than primary (as everyone's doing their backups from DR).
>>>
>>> On HP-UX as a starting point, you could use some IO monitoring tool that
>>> (like sar) that shows you both the I/O rate this server is running & the
>>> average service time (avserv) and avwait/avque (queue lengths). So, if the
>>> DR has twice higher I/O service time than primary, while the I/O rate is
>>> roughly the same (as AWR metrics indicated), then this should be a good
>>> starting point for sysadmins/storage folks.
>>>
>>> $ sar –d 1 10
>>>
>>> HP-UX server01 B.11.23 U ia64 10/25/09
>>>
>>> 20:38:46 device %busy avque r+w/s blks/s avwait avserv
>>> 20:38:56 c0t0d0 1.30 0.50 3 50 0.00 11.07
>>> c31t2d2 0.80 0.50 1 14 0.00 9.22
>>> c31t2d0 3.30 0.50 5 96 0.00 6.14
>>> c31t2d1 3.80 0.50 5 85 0.00 7.30
>>> c31t2d4 1.20 0.50 2 102 0.00 6.04
>>> c31t2d6 0.10 0.50 0 5 0.00 3.83
>>> c31t2d3 2.80 0.50 6 96 0.00 4.67
>>> c31t2d5 1.10 0.50 1 22 0.00 8.44
>>> c31t2d7 0.40 0.50 1 13 0.00 4.90
>>> c31t3d1 0.50 0.50 1 11 0.00 7.95
>>>
>>> --
>>> Tanel Poder
>>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>>> https://tanelpoder.com/conference/
>>>
>>> On Sat, Apr 10, 2021 at 3:54 PM Pap <oracle.developer35_at_gmail.com>
>>> wrote:
>>>
>>>> As it almost seems like the DR side is performing badly because of some
>>>> additional process/database might be there which is still running and
>>>> adding overhead. So what kind of information/stats from the host or
>>>> SAN should we ask which will help us in identifying the contribution of
>>>> each of the processes in primary and DR and this would be able to find the
>>>> exact culprit which is creating a difference during that peak period?
>>>>
>>>> Regards
>>>> Pap
>>>>
>>>> On Sat, Apr 10, 2021 at 12:48 AM Pap <oracle.developer35_at_gmail.com>
>>>> wrote:
>>>>
>>>>> Got to know there are other 3 databases on this box, however at
>>>>> exactly same time we also see a small spike in top activity graph in
>>>>> another database. And i see even when we were live on Primary that other
>>>>> database on DR having that small spike exists , which suggests some problem
>>>>> persisting in the box level rather in the impacted database which we are
>>>>> looking for. So it may be either at host or storage(which the SAN team is
>>>>> denying though).
>>>>>
>>>>> And another thing is even we saw in the AWR we are doing large amount
>>>>> of work or the db block changes were on higher side on primary as compared
>>>>> to DR , but the SAN team saying higher IOPS in DR as compared to primary
>>>>> during that time observed without any queuing. So it sounds strange. So,
>>>>> Just like database AWR report, Is there any such historical report which we
>>>>> can generate out of the HP host or underlying SAN to understand what exact
>>>>> process/database contributes how much during that peak time? And what is
>>>>> the exact IO latency trend and segregation based on database , during that
>>>>> time?
>>>>>
>>>>> Regards
>>>>> Pap
>>>>>
>>>>> On Fri, Apr 9, 2021 at 7:54 PM Pap <oracle.developer35_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>> But isn't it correct to say that system stats or dictionary stats can
>>>>>> only influence the optimizer to go for a different execution path? But in
>>>>>> our case both of the databases are giving an equal execution path, we don't
>>>>>> see any difference in execution path for any such sql which can cause this
>>>>>> issue. It's mainly the single row inserts/updates are running longer on DR
>>>>>> as compared to primary during peak period.
>>>>>>
>>>>>> Regards
>>>>>> Pap
>>>>>>
>>>>>> On Fri, Apr 9, 2021 at 7:50 PM Athil Meer <athildba_at_gmail.com> wrote:
>>>>>>
>>>>>>> Hello Pap,
>>>>>>> Are there any difference in fixed, dictionary and system stats
>>>>>>> between these DBs
>>>>>>>
>>>>>>> Thank you
>>>>>>> Athil
>>>>>>>
>>>>>>> On Fri, 9 Apr 2021, 5:09 pm Pap, <oracle.developer35_at_gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We have both application server and db server based out of primary
>>>>>>>> and DR side each. So when we point to DR our application code/server also
>>>>>>>> runs from DR only. So there should not be any network delay causing such
>>>>>>>> difference.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Pap
>>>>>>>>
>>>>>>>> On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> In a network sense, where is the client application with respect
>>>>>>>>> to the two databases?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> When you are using the “primary” database is the client at the
>>>>>>>>> “primary” database location?
>>>>>>>>>
>>>>>>>>> When you are using the “DR” database is the client at the “DR”
>>>>>>>>> database location?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Depending on how the application handshakes information to be
>>>>>>>>> written and how waits are rolled up, that can make all the difference.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> A very quick test is to actually place a client at the location of
>>>>>>>>> the “DR” database location on the LAN with the “DR” database if in fact
>>>>>>>>> currently you are considering performance of an application on the
>>>>>>>>> “primary” database location in both cases currently.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Explaining how and why the performance difference can actually be
>>>>>>>>> network latency and speed yet appear to be differences in DBWR service rate
>>>>>>>>> is beyond the scope of this note.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> mwf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* oracle-l-bounce_at_freelists.org [mailto:
>>>>>>>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap
>>>>>>>>> *Sent:* Friday, April 09, 2021 8:29 AM
>>>>>>>>> *To:* Tanel Poder
>>>>>>>>> *Cc:* Oracle L
>>>>>>>>> *Subject:* Re: Difference in writer response time
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank You Tanel.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Not having much understanding with respect to the infra side. But
>>>>>>>>> , actually the first thing we want to understand is why there is a
>>>>>>>>> difference in performance during that peak period between the two sites. As
>>>>>>>>> those are primary and DR so ideally everything should be the same
>>>>>>>>> including SAN configurations. We have verified all the DB parameters are
>>>>>>>>> mostly the same and are not suspected such as to impact DBWR throughput one
>>>>>>>>> vs other.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But as you said there may be some issue wrt kernel or IO stack, so
>>>>>>>>> wanted to understand what exact information we can ask from our storage/SAN
>>>>>>>>> team by which we can conclude is there is really a difference in
>>>>>>>>> performance of storage IO between both sides?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And should we ask the IO information only during the issue period
>>>>>>>>> to understand the difference as because in normal time both the databases
>>>>>>>>> perform similar just that during those peak activity the DR is behaving
>>>>>>>>> badly as compared to primary?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Pap
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too
>>>>>>>>> spoiled nowadays with "servers" like this one
>>>>>>>>> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
>>>>>>>>> under my desk ;-)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Some comments:
>>>>>>>>>
>>>>>>>>> 1. The *free buffer waits* show that DBWR falls behind with
>>>>>>>>> syncing dirty buffers to disk (~4000 blocks written per second). So
>>>>>>>>> it looks like a DBWR throughput problem (caused by some issue in the kernel
>>>>>>>>> or I/O stack)
>>>>>>>>> 2. Other than storage hardware latency itself and things like
>>>>>>>>> block device queue depths, the filesystem itself may be a bottleneck - are
>>>>>>>>> you using *VxFS *- and if yes, with or without *ODM*? VxFS
>>>>>>>>> without ODM doesn't use concurrent I/O by default, potentially causing
>>>>>>>>> write-write contention at file level (POSIX single writer locking)
>>>>>>>>> 3. Also, the normal recommendation for production systems is
>>>>>>>>> to use filesystemio_options=*setall*, but I don't remember if
>>>>>>>>> HP-UX was special in that sense (or you just forced direct I/O with a VxFS
>>>>>>>>> mount option)
>>>>>>>>> 4. Run Snapper on DBWRs and see how many of them you have in
>>>>>>>>> each instance and are they 100% busy waiting for one of the *db
>>>>>>>>> file parallel write*, *db file asynch I/O submit* waits (or
>>>>>>>>> show all CPU usage - no wait events whatsoever, can be due to an
>>>>>>>>> instrumentation bug)
>>>>>>>>>
>>>>>>>>> The various "higher level contention" wait events, *buffer busy
>>>>>>>>> waits*, *enq: TX contention *and in rare cases even some latch
>>>>>>>>> contention may be just a symptom of the underlying I/O problems. You need
>>>>>>>>> to do wait chain analysis to see beyond the "top level symptoms".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't
>>>>>>>>> allow me to send screenshots as images, so here are a couple of tweets with
>>>>>>>>> images of such problems (where the high level symptoms show different waits
>>>>>>>>> than the root cause):
>>>>>>>>>
>>>>>>>>> - https://twitter.com/TanelPoder/status/1318023393304825856
>>>>>>>>> - https://twitter.com/TanelPoder/status/1380304923678498820
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Back when I used to deal with HP-UX/Solaris machines with VxFS
>>>>>>>>> more, a really common bottleneck in high-performance databases (of that
>>>>>>>>> era) was people not using concurrent I/O on that filesystem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Tanel Poder
>>>>>>>>>
>>>>>>>>> #vConf2021: Troubleshooting Very Complex Oracle Performance
>>>>>>>>> Problems
>>>>>>>>>
>>>>>>>>> https://tanelpoder.com/conference/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you Tanel.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Its HP machine and its on file system. No ASM. The
>>>>>>>>> filesystemsio_options in gv$paramer showing as "asynch" in both databases.
>>>>>>>>> and also disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX
>>>>>>>>> IA(64 bit), 64 core, 8- socket machine in both side.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The SAN team confirms not much visible difference between the
>>>>>>>>> underlying storage performance between those two sites. So wondering as we
>>>>>>>>> don't use data guards could this difference be just because of just the
>>>>>>>>> structural difference in table/index structure? But we have all the same
>>>>>>>>> volume of data on both sides so not sure how that can play such a big
>>>>>>>>> difference. And contrary to that, I see we are doing more work on the
>>>>>>>>> Primary(faster) side during that time period as compared to DR(slow side),
>>>>>>>>> so how is that possible? I am seeing the direct write is higher on DR i.e.
>>>>>>>>> slow side as compared to primary, not sure if that is pointing to something
>>>>>>>>> suspicious.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have fetched the wait event section of the AWR from both
>>>>>>>>> databases and attaching here(in two tabs as primary and DR wait profile).
>>>>>>>>> Although I do see "free buffer waits" in one of the top lists but the major
>>>>>>>>> one is 'index contention'(top one is a composite PK index but local index,
>>>>>>>>> so less chances of structural difference causing such issues). But again it
>>>>>>>>> might be that the DBWR response trend is just a symptom of something else
>>>>>>>>> but not the cause as the SAN team is saying no difference there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Pap
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> There's also one usual suspect, I don't see *filesystemio_options*
>>>>>>>>> parameter set, are you running on ASM, not a filesystem, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Tanel.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello Lister, We have two databases which are the same in all the
>>>>>>>>> hardware configurations along with DB parameters and are in sync/replicated
>>>>>>>>> using golden gate. And those are treated as primary and DR for us and both
>>>>>>>>> are active. But we are seeing differences in behaviour when our application
>>>>>>>>> points to primary VS when it points to DR. We are seeing during specific
>>>>>>>>> times of the day when our system is at its peak, in one of the database i.e
>>>>>>>>> DR the DBWR response times spikes till ~200+ms slowing down the write
>>>>>>>>> queries while in another database the dbwr response time stays ~65ms for a
>>>>>>>>> similar amount of data and transaction volume. So wanted to understand what
>>>>>>>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>>>>>>>>>
>>>>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Apr 15 2021 - 21:12:11 CEST

Original text of this message