Re: Performance issue - high Free buff wait

From: Pap <oracle.developer35_at_gmail.com>
Date: Thu, 14 Jan 2021 22:47:43 +0530
Message-ID: <CAEjw_fj_GkRsetQevfbhcGdTfHH2kwQsFrE4fzzmCUD54W_3hw_at_mail.gmail.com>





Thank You very much.

I sent a response but that bounced back. Sending it with attachment.

The host CPU section for the issue period is as I have attached in the sheet . I am not seeing though if it's 100% utilized during those ~15minutes.

From the application team we got to know we have ~4 app servers and each having a ~40 max session limit which means they can submit a total Max ~160 session at any point in time. The number of sessions in the AWR showing ~1000 during the issue period vs ~950 during normal period. But as below, the sum of Avg active sessions is showing ~60 during issue period vs ~15 during normal period. Does that mean we are saturating our ~64 cores. And thus we should rather ask the dev team to reduce the max session ,limit of the app server to 64 cores/4= ~16 Per app server. Is this the correct analogy? Note- Avg active session for both the period also i have mentioned in the attached sheet.

Regarding ROLLBACKS, i see during the issue period its 16k vs the normal period its 15k but the transaction rollback is 7K vs 3k. And "rollback changes undo record applied" is ~21k vs ~6k. As i had doubt on the keep alive type rollback may be impacting , so i see "select 1 from dual" query in the AWR report for that duration and its showing ~845k execution in issue period vs ~736k in normal period. So from these it appears to be a significant amount was because of the keep alive type query rollback.

"user commits" is ~329k during the issue period vs ~217K during normal period. But I do see few of the single row insert/updates happening ~1.2million times during issue period vs ~67K times during normal period. And from these figures it seems commits may not be happening per each row , but definitely the row by row insert/update should be made batch rather row by row. But how will this batching logically help to make DB writer/log writer response faster and improve the current situation?

As i mentioned earlier, we have db writer response time logged in awr as ~100ms+ during isseu period vs ~35ms during issue period. So is it an acceptable range? And whether making the "ASYNCH_IO" OFF for data file and temp file is a recommended setting and it will help us getting the db writer response faster? or is there any downside to this?

Regarding increasing the buffer cache, I see the db_cache_size is set as 6GB in v$parameter but I see in AWR it's showing ~14GB, so not sure from where it's fetching that?

Regards

Pap

On Thu, Jan 14, 2021 at 2:08 AM Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:

> From AWR and ash wait chains I might assume that your system is I/O bound.
> You can simply divide DB_time/elapsed = 58. You have 64 cores this means
> you are getting close from CPU perspective. you didn't include the host cpu
> numbers for the issue period. If your CPU is saturated this might be a
> cause as well.
>
> Looks like DB writer cannot write fast enough to clear the buffer cache
> and LGWR cannot keep up with the commit rate.
> I`ll ask dev to check the commit strategy, committing less frequently
> should improve your situation.
>
> Another place where you can check is buffer_cache size. If you have
> available memory you can try to increase it in small increments and try to
> measure the success rate (either wait events or any metrics app guys have).
> I`ll personally not change number writer processes.
>
> Last, but the most important thing, try to understand from a business
> perspective what is happening in that interval, why you have this spike in
> volumes of data. Understanding this will actually help you make the best
> decisions.
>
> For technical details I might be wrong as I kind of lost experience in
> situations like this, the systems I administer are very powerful exadata
> machines and can be resource bound only if dev is doing something very
> wrong.
>
>
> În mie., 13 ian. 2021 la 21:53, Pap <oracle.developer35_at_gmail.com> a
> scris:
>
>> Thank You so much.
>>
>> Attached is the DASH results. I am not sure how I should interpret the
>> WAIT_CHAIN column. But it shows Top ~338 sessions showing to be on ON CPU
>> only and contributing to ~29% activity followed by free buffer waits which
>> is ~17% of the overall activity.
>>
>> Want to understand, from what figure you came to know this one *"you
>> have 64 cores most probably you reached almost peak CPU usage during that
>> period "?*
>>
>> Does it mean that we are actually overloading the system during those
>> ~15minutes duration and thus the contention and slowness is observed
>> considering current ~64 cpu capacity and thus only option here should be to
>> ask the application team to reduce the max connection at any point in time
>> to the database or say reducing the max session limit at DB side?
>>
>> Regarding the rollbacks , i need to check , if those are real "rollbacks"
>> or some keep alive query from application is logging those.
>>
>> Thanks And Regards
>> Pap
>>
>>
>> On Wed, Jan 13, 2021 at 11:53 PM Laurentiu Oprea <
>> laurentiu.oprea06_at_gmail.com> wrote:
>>
>>> Hello Pap,
>>>
>>> Is not just those wait events, there is clearly a significant
>>> difference in load between the 2 periods you mentioned.
>>>
>>> -> the moment you open the excel you can see DB time 237.29 (mins)
>>> vs 874.29 (mins)
>>> Based on the fact that you have 64 cores most probably you reached
>>> almost peak CPU usage during that period maybe you was in the position
>>> where processes competed to get time on cpu.
>>>
>>> going down
>>> -> db block changes 9720600 vs 53870404 looks like transactions do more
>>> work ( more than 5 times more)
>>> -> redo size 2448278000 vs 11624470400 correlated with above metric you
>>> generated 5 times more redo info
>>> -> user rollbacks / user commits -> both look high , you can check with
>>> dev guys why you commit so often and why you have the rollbacks (do you
>>> have any errors in alertlog, to they have any timeout at app level)
>>>
>>> As you might be cpu bound during this peak period you might be storage
>>> bound as well. It is possible for the whole system to be slowed down by
>>> LGWR.
>>> You can validate this using Tanel`s chain waits:
>>>
>>> https://github.com/tanelpoder/tpt-oracle/blob/master/ash/dash_wait_chains.sql
>>>
>>> you can use it like this:
>>> _at_dash_wait_chains event2 1=1 "TIMESTAMP'2021-01-12 20:00:00'"
>>> "TIMESTAMP'2021-01-12 20:15:00'"
>>> and post output
>>>
>>> Good luck.
>>>
>>>
>>> În mie., 13 ian. 2021 la 16:05, Pap <oracle.developer35_at_gmail.com> a
>>> scris:
>>>
>>>> Not sure why , but the first email I dropped on this question was
>>>> bounced back from oracle-l, so trying again.
>>>>
>>>> Hello experts,
>>>>
>>>> We have a database with Oracle version 11.2.0.4. We are getting
>>>> complaints of slowness in one of our jobs mainly for ~15minutes duration on
>>>> a daily basis. And there is no plan change or any Undo reads for the
>>>> underlying sqls and also there is not one sql which we can singled out for
>>>> that job duration and that job consists of many small queries(selects,
>>>> inserts running many thousand times in that duration). But while
>>>> seeing/comparing the AWR for that ~15minutes period VS another ~15minutes
>>>> just before that, we are seeing a few odd waits , they are a combination of
>>>> Configuration(free buffer waits) followed by concurrency(buffer busy, index
>>>> contention etc) foreground waits. We do see comparatively higher sessions
>>>> during this ~15minutes window as compared to normal time.
>>>>
>>>> In the AWR under section "IOStat by Function summary" the DB writer
>>>> avg response time is logged as ~103 ms vs in normal period it stays ~35ms.
>>>> And then checking v$iostat_file , we see it has ASYNCH_IO set as OFF for
>>>> both "data file" and "temp file" which I have seen in many databases
>>>> normally set to ON. Also in dba_hist_active_sess_history the top waits
>>>> events are showing as below. I have attached excel with specific sections
>>>> of AWR during the issue period and normal period in two different tabs.
>>>>
>>>> 1)So , is it correct that by turning this ASYNCH_IO "ON" for data/temp
>>>> file , will cater all these issues because these all seem to be triggering
>>>> from the slow DB writer performance?
>>>>
>>>> Or
>>>>
>>>> 2)Should we ask the application team to reduce the total number of
>>>> sessions(maybe by decreasing the max connection limit) which are submitting
>>>> to the Database at that point to reduce contention?
>>>>
>>>> *Below is count of waits from Dba_hist_active_sess_history for that
>>>> ~15minutes window:-*
>>>>
>>>> Top two(log file sync and db file async I/O submit) are not associated
>>>> with application user/session.
>>>>
>>>> EVENT COUNT(*)
>>>> log file sync 260
>>>> db file async I/O submit 188
>>>> free buffer waits 72
>>>> 66
>>>> 65
>>>> 63
>>>> 62
>>>> db file sequential read 56
>>>> write complete waits 54
>>>> 54
>>>> db file sequential read 51
>>>>
>>>> Regards
>>>>
>>>> Pap
>>>>
>>>>
>>>>
>>>>



--
http://www.freelists.org/webpage/oracle-l


Received on Thu Jan 14 2021 - 18:17:43 CET

Original text of this message