Re: Difference in writer response time

From: Pap <oracle.developer35_at_gmail.com>
Date: Fri, 16 Apr 2021 09:53:06 +0530
Message-ID: <CAEjw_fhqH-YRsEJOdCh_ARS9BE0NkuLdeLvJMFZEs9SkpOyogA_at_mail.gmail.com>



  Yes we have disk_asynch_io set as 'true' with filesystemio_options set as 'asynch' but as its HP-UX and this natively does not support asynchronous IO, so perhaps there is no meaning of having db_writer_processes>1, but i do see we have set it as -6. So thinking to go back to db_writer_processes=1 with db_io_slaves>1 (not sure if there is any formula to determine a proper value for this) should make us process things like asynchronous IO thus improving DB writer response time. And I hope as this system does not natively support asynchronous IO so no way the direct I/O or concurrent I/O can be set/achieved here which are normally recommended by Oracle. Correct me if wrong.

And another thought I had was, As we don't use data guards here , whether having the asynchronous IO option set will increase the risk of block corruption anyway?

On Fri, Apr 16, 2021 at 12:42 AM Mark J. Bobak <mark_at_bobak.net> wrote:

> If I recall correctly, you must have db_writer_processes=1 if you're
> limited by your filesystem not natively supporting asynch I/O, and setting
> db_io_slaves > 1 is a way to "simulate" asynch I/O for those filesystems
> that don't offer it. Where possible, you should get superior performance
> from db_writer_processes > 1.
>
> -Mark
>
> On Thu, Apr 15, 2021, 14:23 Pap <oracle.developer35_at_gmail.com> wrote:
>
>> Even though the filesystemio_options set as ASYNCH but i see in
>> v$iostat_file the column ASYNCH_IO is set as ''ASYNCH_OFF" for all the data
>> files. So is it because we are not having data guard configuration and thus
>> to avoid any block corruption, it's been set as "ASYNCH_OFF" but setting
>> this ON will increase the DB writer speed?
>>
>> Saw one blog stating keeping db_writer_processes=1, dbwr_io_slaves=32
>> helped improve the dbwriter response significantly.
>>
>>
>> https://www.bobbydurrettdba.com/2013/04/26/db_writer_processes-dbwr_io_slaves-with-no-asynch-io-on-hp-ux/
>>
>> Regards
>> Pap
>>
>> On Wed, Apr 14, 2021 at 9:42 PM Pap <oracle.developer35_at_gmail.com> wrote:
>>
>>> Thanks a lot, Tanel.
>>>
>>> I got the SAN IO statistics report from the infra team for two of the
>>> days where application was pointing and was experiencing issues in DR but
>>> not in primary. What i see is the IO rate(i.e. read/sec) in the attached
>>> file(primary_DR_SAN_Stats.xlsx) , seems to be on higher side when
>>> application was pointing to DR, but in both the cases i don't see any
>>> increase in avserv i.e. average service time , so should we conclude that
>>> , even we have some additional processes might be running on DR but should
>>> not matter as long as the IO service time is almost zero and also no
>>> queuing is observed from SAN standpoint for DR? But yes this does not match
>>> with the picture which AWR shows i.e. DBWR response time slowness in DR as
>>> compared to primary.
>>>
>>> Additionally in the AWR which i had shared here earlier in this thread,
>>> for DR as it shows "index contention" as the top most wait and there does
>>> exist "ITL '' waits(in enqueu_activity_dr.xlsx) , so should we increase the
>>> INITRANS value to help us improve the situation here on DR(and also in
>>> primary) without any negative impact?
>>>
>>> Regards
>>> Pap
>>>
>>>
>>> On Sun, Apr 11, 2021 at 7:36 AM Tanel Poder <tanel_at_tanelpoder.com>
>>> wrote:
>>>
>>>> The primary and DR run on different storage arrays that may be
>>>> configured and "sized" differently (sometimes DR/backup has slower disks or
>>>> something like that). Or even if the storage arrays, HBAs & SAN network are
>>>> identical across both environments, the DR storage array may be overloaded
>>>> more than primary (as everyone's doing their backups from DR).
>>>>
>>>> On HP-UX as a starting point, you could use some IO monitoring tool
>>>> that (like sar) that shows you both the I/O rate this server is running &
>>>> the average service time (avserv) and avwait/avque (queue lengths). So, if
>>>> the DR has twice higher I/O service time than primary, while the I/O rate
>>>> is roughly the same (as AWR metrics indicated), then this should be a good
>>>> starting point for sysadmins/storage folks.
>>>>
>>>> $ sar –d 1 10
>>>>
>>>> HP-UX server01 B.11.23 U ia64 10/25/09
>>>>
>>>> 20:38:46 device %busy avque r+w/s blks/s avwait avserv
>>>> 20:38:56 c0t0d0 1.30 0.50 3 50 0.00 11.07
>>>> c31t2d2 0.80 0.50 1 14 0.00 9.22
>>>> c31t2d0 3.30 0.50 5 96 0.00 6.14
>>>> c31t2d1 3.80 0.50 5 85 0.00 7.30
>>>> c31t2d4 1.20 0.50 2 102 0.00 6.04
>>>> c31t2d6 0.10 0.50 0 5 0.00 3.83
>>>> c31t2d3 2.80 0.50 6 96 0.00 4.67
>>>> c31t2d5 1.10 0.50 1 22 0.00 8.44
>>>> c31t2d7 0.40 0.50 1 13 0.00 4.90
>>>> c31t3d1 0.50 0.50 1 11 0.00 7.95
>>>>
>>>> --
>>>> Tanel Poder
>>>> #vConf2021: Troubleshooting Very Complex Oracle Performance Problems
>>>> https://tanelpoder.com/conference/
>>>>
>>>> On Sat, Apr 10, 2021 at 3:54 PM Pap <oracle.developer35_at_gmail.com>
>>>> wrote:
>>>>
>>>>> As it almost seems like the DR side is performing badly because of
>>>>> some additional process/database might be there which is still running and
>>>>> adding overhead. So what kind of information/stats from the host or
>>>>> SAN should we ask which will help us in identifying the contribution of
>>>>> each of the processes in primary and DR and this would be able to find the
>>>>> exact culprit which is creating a difference during that peak period?
>>>>>
>>>>> Regards
>>>>> Pap
>>>>>
>>>>> On Sat, Apr 10, 2021 at 12:48 AM Pap <oracle.developer35_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Got to know there are other 3 databases on this box, however at
>>>>>> exactly same time we also see a small spike in top activity graph in
>>>>>> another database. And i see even when we were live on Primary that other
>>>>>> database on DR having that small spike exists , which suggests some problem
>>>>>> persisting in the box level rather in the impacted database which we are
>>>>>> looking for. So it may be either at host or storage(which the SAN team is
>>>>>> denying though).
>>>>>>
>>>>>> And another thing is even we saw in the AWR we are doing large amount
>>>>>> of work or the db block changes were on higher side on primary as compared
>>>>>> to DR , but the SAN team saying higher IOPS in DR as compared to primary
>>>>>> during that time observed without any queuing. So it sounds strange. So,
>>>>>> Just like database AWR report, Is there any such historical report which we
>>>>>> can generate out of the HP host or underlying SAN to understand what exact
>>>>>> process/database contributes how much during that peak time? And what is
>>>>>> the exact IO latency trend and segregation based on database , during that
>>>>>> time?
>>>>>>
>>>>>> Regards
>>>>>> Pap
>>>>>>
>>>>>> On Fri, Apr 9, 2021 at 7:54 PM Pap <oracle.developer35_at_gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> But isn't it correct to say that system stats or dictionary stats
>>>>>>> can only influence the optimizer to go for a different execution path? But
>>>>>>> in our case both of the databases are giving an equal execution path, we
>>>>>>> don't see any difference in execution path for any such sql which can cause
>>>>>>> this issue. It's mainly the single row inserts/updates are running longer
>>>>>>> on DR as compared to primary during peak period.
>>>>>>>
>>>>>>> Regards
>>>>>>> Pap
>>>>>>>
>>>>>>> On Fri, Apr 9, 2021 at 7:50 PM Athil Meer <athildba_at_gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Pap,
>>>>>>>> Are there any difference in fixed, dictionary and system stats
>>>>>>>> between these DBs
>>>>>>>>
>>>>>>>> Thank you
>>>>>>>> Athil
>>>>>>>>
>>>>>>>> On Fri, 9 Apr 2021, 5:09 pm Pap, <oracle.developer35_at_gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We have both application server and db server based out of primary
>>>>>>>>> and DR side each. So when we point to DR our application code/server also
>>>>>>>>> runs from DR only. So there should not be any network delay causing such
>>>>>>>>> difference.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Pap
>>>>>>>>>
>>>>>>>>> On Fri, Apr 9, 2021 at 6:49 PM Mark W. Farnham <mwf_at_rsiz.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In a network sense, where is the client application with respect
>>>>>>>>>> to the two databases?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When you are using the “primary” database is the client at the
>>>>>>>>>> “primary” database location?
>>>>>>>>>>
>>>>>>>>>> When you are using the “DR” database is the client at the “DR”
>>>>>>>>>> database location?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Depending on how the application handshakes information to be
>>>>>>>>>> written and how waits are rolled up, that can make all the difference.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> A very quick test is to actually place a client at the location
>>>>>>>>>> of the “DR” database location on the LAN with the “DR” database if in fact
>>>>>>>>>> currently you are considering performance of an application on the
>>>>>>>>>> “primary” database location in both cases currently.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Explaining how and why the performance difference can actually be
>>>>>>>>>> network latency and speed yet appear to be differences in DBWR service rate
>>>>>>>>>> is beyond the scope of this note.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> mwf
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* oracle-l-bounce_at_freelists.org [mailto:
>>>>>>>>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Pap
>>>>>>>>>> *Sent:* Friday, April 09, 2021 8:29 AM
>>>>>>>>>> *To:* Tanel Poder
>>>>>>>>>> *Cc:* Oracle L
>>>>>>>>>> *Subject:* Re: Difference in writer response time
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank You Tanel.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Not having much understanding with respect to the infra side. But
>>>>>>>>>> , actually the first thing we want to understand is why there is a
>>>>>>>>>> difference in performance during that peak period between the two sites. As
>>>>>>>>>> those are primary and DR so ideally everything should be the same
>>>>>>>>>> including SAN configurations. We have verified all the DB parameters are
>>>>>>>>>> mostly the same and are not suspected such as to impact DBWR throughput one
>>>>>>>>>> vs other.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But as you said there may be some issue wrt kernel or IO stack,
>>>>>>>>>> so wanted to understand what exact information we can ask from our
>>>>>>>>>> storage/SAN team by which we can conclude is there is really a difference
>>>>>>>>>> in performance of storage IO between both sides?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And should we ask the IO information only during the issue period
>>>>>>>>>> to understand the difference as because in normal time both the databases
>>>>>>>>>> perform similar just that during those peak activity the DR is behaving
>>>>>>>>>> badly as compared to primary?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Pap
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Ok, sorry I didn't realize this was on HP-UX (I guess I'm too
>>>>>>>>>> spoiled nowadays with "servers" like this one
>>>>>>>>>> <https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
>>>>>>>>>> under my desk ;-)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Some comments:
>>>>>>>>>>
>>>>>>>>>> 1. The *free buffer waits* show that DBWR falls behind with
>>>>>>>>>> syncing dirty buffers to disk (~4000 blocks written per second). So
>>>>>>>>>> it looks like a DBWR throughput problem (caused by some issue in the kernel
>>>>>>>>>> or I/O stack)
>>>>>>>>>> 2. Other than storage hardware latency itself and things like
>>>>>>>>>> block device queue depths, the filesystem itself may be a bottleneck - are
>>>>>>>>>> you using *VxFS *- and if yes, with or without *ODM*? VxFS
>>>>>>>>>> without ODM doesn't use concurrent I/O by default, potentially causing
>>>>>>>>>> write-write contention at file level (POSIX single writer locking)
>>>>>>>>>> 3. Also, the normal recommendation for production systems is
>>>>>>>>>> to use filesystemio_options=*setall*, but I don't remember if
>>>>>>>>>> HP-UX was special in that sense (or you just forced direct I/O with a VxFS
>>>>>>>>>> mount option)
>>>>>>>>>> 4. Run Snapper on DBWRs and see how many of them you have in
>>>>>>>>>> each instance and are they 100% busy waiting for one of the *db
>>>>>>>>>> file parallel write*, *db file asynch I/O submit* waits (or
>>>>>>>>>> show all CPU usage - no wait events whatsoever, can be due to an
>>>>>>>>>> instrumentation bug)
>>>>>>>>>>
>>>>>>>>>> The various "higher level contention" wait events, *buffer busy
>>>>>>>>>> waits*, *enq: TX contention *and in rare cases even some latch
>>>>>>>>>> contention may be just a symptom of the underlying I/O problems. You need
>>>>>>>>>> to do wait chain analysis to see beyond the "top level symptoms".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't
>>>>>>>>>> allow me to send screenshots as images, so here are a couple of tweets with
>>>>>>>>>> images of such problems (where the high level symptoms show different waits
>>>>>>>>>> than the root cause):
>>>>>>>>>>
>>>>>>>>>> - https://twitter.com/TanelPoder/status/1318023393304825856
>>>>>>>>>> - https://twitter.com/TanelPoder/status/1380304923678498820
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Back when I used to deal with HP-UX/Solaris machines with VxFS
>>>>>>>>>> more, a really common bottleneck in high-performance databases (of that
>>>>>>>>>> era) was people not using concurrent I/O on that filesystem.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Tanel Poder
>>>>>>>>>>
>>>>>>>>>> #vConf2021: Troubleshooting Very Complex Oracle Performance
>>>>>>>>>> Problems
>>>>>>>>>>
>>>>>>>>>> https://tanelpoder.com/conference/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35_at_gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Thank you Tanel.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Its HP machine and its on file system. No ASM. The
>>>>>>>>>> filesystemsio_options in gv$paramer showing as "asynch" in both databases.
>>>>>>>>>> and also disk_asynch_io is also set as TRUE. As I see from AWR, Its HP-UX
>>>>>>>>>> IA(64 bit), 64 core, 8- socket machine in both side.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The SAN team confirms not much visible difference between the
>>>>>>>>>> underlying storage performance between those two sites. So wondering as we
>>>>>>>>>> don't use data guards could this difference be just because of just the
>>>>>>>>>> structural difference in table/index structure? But we have all the same
>>>>>>>>>> volume of data on both sides so not sure how that can play such a big
>>>>>>>>>> difference. And contrary to that, I see we are doing more work on the
>>>>>>>>>> Primary(faster) side during that time period as compared to DR(slow side),
>>>>>>>>>> so how is that possible? I am seeing the direct write is higher on DR i.e.
>>>>>>>>>> slow side as compared to primary, not sure if that is pointing to something
>>>>>>>>>> suspicious.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have fetched the wait event section of the AWR from both
>>>>>>>>>> databases and attaching here(in two tabs as primary and DR wait profile).
>>>>>>>>>> Although I do see "free buffer waits" in one of the top lists but the major
>>>>>>>>>> one is 'index contention'(top one is a composite PK index but local index,
>>>>>>>>>> so less chances of structural difference causing such issues). But again it
>>>>>>>>>> might be that the DBWR response trend is just a symptom of something else
>>>>>>>>>> but not the cause as the SAN team is saying no difference there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Pap
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel_at_tanelpoder.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> There's also one usual suspect, I don't see
>>>>>>>>>> *filesystemio_options* parameter set, are you running on ASM,
>>>>>>>>>> not a filesystem, right?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Tanel.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35_at_gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello Lister, We have two databases which are the same in all the
>>>>>>>>>> hardware configurations along with DB parameters and are in sync/replicated
>>>>>>>>>> using golden gate. And those are treated as primary and DR for us and both
>>>>>>>>>> are active. But we are seeing differences in behaviour when our application
>>>>>>>>>> points to primary VS when it points to DR. We are seeing during specific
>>>>>>>>>> times of the day when our system is at its peak, in one of the database i.e
>>>>>>>>>> DR the DBWR response times spikes till ~200+ms slowing down the write
>>>>>>>>>> queries while in another database the dbwr response time stays ~65ms for a
>>>>>>>>>> similar amount of data and transaction volume. So wanted to understand what
>>>>>>>>>> can cause such drastic differences? Note- Its version 11.2.0.4 of Oracle.
>>>>>>>>>>
>>>>>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Apr 16 2021 - 06:23:06 CEST

Original text of this message