Re: Performance issue - high Free buff wait

From: Pap <oracle.developer35_at_gmail.com>
Date: Tue, 19 Jan 2021 14:01:55 +0530
Message-ID: <CAEjw_fjcL5XOOsCZpw5bO2wbFxej9aZheqAKM2vig3wtJq8QiQ_at_mail.gmail.com>



This is great information. Thank you very much Jonathan. It really helped me in understanding more about what is going on here.

As you rightly said, there are multiple things to look into here starting from slow write I/O for DBWR , making more memory available to this database, making the DML in batches to minimize the chatter, minimizing the execution of concurrent processes those doing the same thing, finding if any unnecessary indexes in the DML intensive tables. I will try to implement these one by one.

Thanks And Regards
Pap

On Mon, Jan 18, 2021 at 8:41 PM Jonathan Lewis <jlewisoracle_at_gmail.com> wrote:

>
> From the latest AWR - SQL ordered by, Segments by, Buffer pool ...
>
> General guideline - you won't beat the LRU algorithms for the buffer pool,
> so assigning a keep pool is usually counter productive. Sometimes it's
> worth creating a relativbely small RECYCLE pool to stop some large object
> "polluting" the default buffer pool with blocks that are never re-used once
> read (in your case you don't seem to be using it at all in this interval,
> so if you find no objects assigned to it then you might as well get rid of
> that too and assign the memory to the default.
>
> For a guideline on how much bigger to make the buffer pool, there the
> buffer cache advisory - check to see what Oracle thinks will make a big
> difference - it's not a good guideline but it might give you some idea as
> to wether (e.g. doubling it) will help or whether you need to go to
> something much larger (like 10 times - and given that you've apparanetly
> got about 250GB of free memory in the machine maybe even 10 times would be
> viable). The uncertainty is made greater because you have to consider the
> effect of getting rid of the keep and recycle. You should be able to
> resize dynamically (within the 24GB total you've got), though, so at least
> you don't have to restart the instance to change the pools.
>
> You might also look at v$sgastat (there's a section in the report about
> the library cache detail) for memory in the shared pool labelled something
> likeL "KGH: no access" - this is shared pool memory being used to hold data
> blocks - but I don't think you'll see any because I suspect you don't have
> the sga_target set and are depending on fixed memory sizes.
>
>
> The Executions tell you where the application problem is: it's not just
> that there are very big numbers of single row executions - the 3 "new"
> statements in the busy period point you to part of the application. So you
> can find out what (business) is happening there, and find out if there are
> multiple concurrent copies of the same (business) process running - you may
> find that if you can reduce the number of copies of that process then the
> database will survive the experience rather better. It does look as if
> whatever it is it's somehow connected with the streams processing -- but
> that may simply be "we do more X, and streams is used to copy X to
> somewhere else".
>
>
> You'll notice, of course, that two of the three big new statements are an
> insert and an update - they are presumably the source of all the extra db
> block changes, and the large number of writes (and reads). For a pure "fix
> the SQL not the problem" approach you could check the indexing on the
> target tables to see if there are any redundant indexes adding to the
> workload; similarly for the other two high-usage statements you could check
> that the select statements are using the most efficient path or whether the
> path could be made more efficient.
>
>
> Bottom line - for this 15 minute each day (if that's really all it is)
> your machine is simply overloaded by the amount of work it's being told to
> do. You seem to be protected on physical reads by a massive file system
> cache, but the cache doesn't protect you on writes so (even in the "normal"
> period) your writes are slow and when you go under heavy load the write
> response becomes catastrophic resulting in demands for free buffers that
> can't be met.
>
>
> Regards
> Jonathan Lewis
>
>
>
>
> On Mon, 18 Jan 2021 at 11:40, Pap <oracle.developer35_at_gmail.com> wrote:
>
>> Thank You Jonathan.
>>
>> My mistake, I didn't notice the KEEP and other pools allocation of buffer
>> cache were hardcoded.
>>
>> As per attached the AWR section from buffer pool statistics and streams
>> activity. Buffer pool statistics does show , it's the default buffer cache
>> which is experiencing all the "free buffer waits". Zero "free buffer waits"
>> logged against Keep and recycle pool. Then moving down to segment by block
>> changes , I see two of the table partitions + index partition at the top.
>> And the "sql order by execution" does map to the sqls doing row by row
>> insert and Update involving those tables.
>>
>> Seeing no data exists under section "Memory Resize Ops" and "Memory
>> Resize Operations Summary" for both issue and non issue period. So I hope
>> it means no memory movement happens in/out of the buffer cache.
>>
>> CPU and other resources noted under the "Streams section" of the AWR
>> report are under ~10minutes of DB time. So I hope that may not be a
>> contributing factor here.
>>
>> With regards to total memory being ~500GB+ and 60% of this is free on
>> this host. Currently for this database we have ~24GB allocated to SGA and
>> ~12GB for PGA_Aggregate_target. So by what amount of additional memory
>> should we bump the SGA_MAX_SIZE here?
>>
>> I am now seeing there are ~19 tables which are assigned to the KEEP
>> pool(which is ~8GB in size) , but the sum of the size of all those ~19
>> tables are coming around ~13GB. And the odd thing is one of those tables is
>> exceeding ~12Gb itself in size. So i need to check if it's really serving
>> some purpose or we can remove the KEEP pool and merge it to the default
>> buffer cache only.
>>
>> Regards
>>
>> Pap
>>
>> On Mon, Jan 18, 2021 at 2:15 AM Jonathan Lewis <jlewisoracle_at_gmail.com>
>> wrote:
>>
>>> I made two mistakes reading the details about parsing -
>>> a) I forgot the "parse time" in the profile was in hundredths of
>>> seconds
>>> b) I took the library cache mutex X at 2,200 seconds to be reasonably
>>> consistent with the 3600 figure for parse time, but since that was actually
>>> 36 seconds, and since it wasn't CURSOR mutex X I was putting 2 and 2
>>> together to get 5.
>>> So the parsing thing is probably mostly irrelevant.
>>>
>>>
>>> Don't be fooled by the fact I typed MB instead of GB - you have 512GB of
>>> memory and a huge fraction of it is "free". Why isn't more given to the
>>> database.
>>> We're also beginning to see some detail of what you're up to: your
>>> title is High Free Buffer Waits, but it's taken about 10 posts before you
>>> mention that you've got a KEEP cache and a DEFAULT cache, and that there
>>> are several GB of SGA which won't be available to the shared_pool or buffer
>>> cache because they are allocated to the Streams Pool. (Find out what
>>> you're application is supposed to be doing in that 15 minutes and ask
>>> yourself whether that might have some extra impact because of Streams;
>>> check the Streams section of the AWR report to see if there is any Streams
>>> activity going on at the time, or at all).
>>>
>>> Since you've got a KEEP cache and a DEFAULT cache, can you identify
>>> which cache is subject to the free buffer waits. (There's a very old
>>> article about the oddities that used to appear with multiple caches here:
>>> https://jonathanlewis.wordpress.com/2006/11/21/free-buffer-waits/ that
>>> might help, but there's a "Buffer Pool Statistics" section of the AWR
>>> report that gives you a simple summary.) Knowing which objects are
>>> associated with the KEEP pool may also be helpful. And the "Segments by
>>> ..." sections of the AWR will give you some information about which objects
>>> are subject to most changes in the period. Combine that with "SQL ordered
>>> by ..." (Executions is the obvious starting point) to get some idea of
>>> what's happening in the interval if you have no other way of finding out
>>> what the application is supposed to be doing. Bear in mind that "%
>>> captured" is reported for several of these reports, so it's important to
>>> check if you're looking at a meaningfull percentage or whether some figure
>>> looks harmless because it's only a small percentage of the real total.
>>>
>>>
>>> Another thing to check (which MAY be related to the library cache mutex
>>> X) is the section on memory resizing - do you have much activity of
>>> granules of memory being moved between the library cache the streams pool,
>>> and the buffer cache.
>>>
>>> Regards
>>> Jonathan Lewis
>>>
>>>
>>>
>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free.
>>> www.avg.com
>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>> <#m_3367736491194591486_m_5825202310365730355_m_-2580758597909172818_m_-2322454506161063817_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> On Sun, 17 Jan 2021 at 19:31, Pap <oracle.developer35_at_gmail.com> wrote:
>>>
>>>> Thanks much, Jonathan.
>>>>
>>>> With respect to parsing , i have attached the "time model
>>>> statistics' section (for issue period and normal period in two tabs) and it
>>>> shows , the parse time is <1% of the over all DB time and even the top sql
>>>> by parsing , the top one which is shown as ~2.7million execution does use
>>>> bind variable, its something as this. (SELECT MAX (col1) FROM TAB1 WHERE
>>>> ID1 = :B2 AND date1 = :B1;). So then, can we safely say parsing(hard
>>>> parsing) to not be one of the causes?
>>>>
>>>> To your point "are you "*wasting" a lot of memory (and CPU) to manage
>>>> a file-system. If not, how hard are the other applications / databases on
>>>> this machine working during this period*."
>>>>
>>>> Not heard of any issue from other databases exactly during the same
>>>> point in time but yes they have very less activity happen on them and i do
>>>> see a small spike in concurrency during that exact time on them too. But ,
>>>> if I see the host memory utilization pattern from the last 7 days it's
>>>> always lying below ~40%. And the CPU utilization is max ~75% but it's not
>>>> showing a spike during that ~15minutes window. I am checking this on the
>>>> OEM Host utilization section , so not sure if it's hiding anything here.
>>>> And as I mentioned, the spike in AAS in the top activity section of this
>>>> database is clearly visible daily for that ~15minutes window in the top
>>>> activity section of OEM which touched to ~80(~64 being the limit).
>>>>
>>>> Also Jonathan to your point *"and your SGA size is only 24MB so you
>>>> may be trying to recycle your buffer cache incredibly quickly.", *the
>>>> SGA is actually set ~24GB , however if i see some of the major components
>>>> hard coded and are in GB's , they are as 1) Keep buffer cache set as ~8GB ,
>>>> 2)Default buffer cache - 6GB, 3)Shared pool - 4Gb and 4)streams pool- 5GB.
>>>> So does it mean that we really need to bump up the size of SGA so that the
>>>> default buffer cache can get more memory which may help us to reduce some
>>>> contention because the default buffer cache recycle frequency can be
>>>> minimized ?
>>>>
>>>> Regards
>>>>
>>>> Pap
>>>>
>>>> On Sun, Jan 17, 2021 at 9:31 PM Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>> wrote:
>>>>
>>>>> 1) No.
>>>>> 2) No
>>>>>
>>>>> Going back to first principles - typical Unix systems time-slice CPUs
>>>>> 1/100 of a second at a time and OLTP systems do tiny amounts of work that
>>>>> are likely to take far less than 1/100 of a second if they're written
>>>>> efficiently. So a well-written OLTP system can probably support a few tens
>>>>> of user sessions per (real) CPUs because most of the session will spend
>>>>> most of their time not doing anything.
>>>>>
>>>>> If, however, you have 64 CPUs and 64 sessions which are trying to
>>>>> model protein folding images from a chemical formula then you've allocated
>>>>> too many sessions because every single one of them will want 100% of a CPU
>>>>> 100% of the time and you've got no CPU left for anything else. More
>>>>> realistically, if you have a couple of sessions which are trying to do
>>>>> something resource intensive like running a query over the last 2 years of
>>>>> sales data to produce a one page summary of customer trends then they are
>>>>> probably taking out one CPU per session (or N CPUs each if you run the
>>>>> queries parallel N). So rule 1 is "know your application"
>>>>>
>>>>>
>>>>> Now take a look at the two bits of AWR you first posted.
>>>>> 1) The average single block read time in BOTH time periods is around
>>>>> 1ms - which means its coming from cache or flash; so it's likely to be more
>>>>> CPU than I/O time.
>>>>>
>>>>> 2) The number of db block changes in the heavy period is about 60,000
>>>>> per SECOND, compared to about 11,000. (Halve these figures since roughly
>>>>> half will be for undo records going to pinned blocks) - so in the busy time
>>>>> period you are potentially make 30,000 blocks dirty per SECOND and will
>>>>> need to write them out, which is about 240MB per second, and your SGA size
>>>>> is only 24MB so you may be trying to recycle your buffer cache incredibly
>>>>> quickly.
>>>>>
>>>>> 3) As a side effect of the db block changes you are generating 12.7MB
>>>>> of redo per second compared to 2.8MB per second. In a production system not
>>>>> only does that have to be written to at least 2 files, it has to be re-read
>>>>> and copied to an archived log destination.
>>>>>
>>>>> Combining 2 and 3 it's not surprising that the rate of data change
>>>>> means you have a huge amount of conflict in (a) writing data (b) finding
>>>>> free buffers to read more data and (c) waitinig for other users to read
>>>>> the data you want to read.
>>>>>
>>>>> It's not easy to determine cause and effect after this - but the rate
>>>>> at which you work in that 15 minutes is simply overloading the system. So
>>>>> the first thing I'd do is try to find out what that 64 CPUs / 64 Cores
>>>>> really means. HP/UX is one of the less well-known operating systems, so
>>>>> any comment you hear about "unix systems" might be very misleading - but to
>>>>> me it looks as if you may not really have 64 CPUs. Then I'd ask about the
>>>>> I/O configuration - I know HP/UX is a bit unusual here, but what are the
>>>>> setting to avoid double caching - you have 512GB of RAM and assign 24GB to
>>>>> this SGA - are you "wasting" a lot of memory (and CPU) to managing a
>>>>> file-system. If not, how hard are the other applications / databases on
>>>>> this machine working during this period.
>>>>>
>>>>> The other thing that stands out as a separate anomaly is the parsing -
>>>>> which ends up with lots of "lilbrary cache - mutex X". The counts are high
>>>>> in both cases and worth looking into, but very expensive in the worst
>>>>> case. How many of your "parse count" turn into "parse count (hard)", and
>>>>> what does the time model show as the distribution of parse time. At a guess
>>>>> I'd say that in this interval you kick of several processes (maybe around
>>>>> 10) that are all trying to optimize the same statements at the same time -
>>>>> which leads to the "mutex X" (the rough idea is that 9 out of 10 would be
>>>>> waiting while the 10th does the optimising) - anifd maybe there are a lot
>>>>> of statments which need a lot of parsing in an environment that is already
>>>>> generaing a lot of CPU conflict. Note that your EXECUTE count jumps from
>>>>> 6,400 per second to 13,200 per second - you mentioned single-row
>>>>> processing, the numbers suggest that that might be the case, but also that
>>>>> a lot of them use literal values, and several sessions may be trying to
>>>>> parse and execute the same statement at the same time.
>>>>>
>>>>>
>>>>> Summary:
>>>>> Check your system and how its configured.
>>>>> Understand what the application is doing in this 15 minutes
>>>>> Then worry about getting the same job done with less work and
>>>>> (particularly) less conflict.
>>>>>
>>>>>
>>>>> Regards
>>>>> Jonathan Lewis
>>>>>
>>>>>
>>>>>
>>>>> On Sun, 17 Jan 2021 at 11:18, Pap <oracle.developer35_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Basically I have two doubts here,
>>>>>>
>>>>>> As it seems for 15minutes period we are pushing the System beyond its
>>>>>> capacity. So considering we have 64 core CPU here , should we only allow
>>>>>> max 64 active session at any given point from application? And then
>>>>>> considering we have 4 app server we need to set max session limit to 16
>>>>>> from each app server?
>>>>>>
>>>>>> Another doubt I had was if the dB writer response time of 100ms
>>>>>> during that issue period is the cause or the result of slowness? If it's
>>>>>> the cause can asynch IO setup for data file help improve situation here? Or
>>>>>> by just making row by row operations converted to batch will be enough ?
>>>>>>
>>>>>> Regards
>>>>>> Pap
>>>>>>
>>>>>>
>>>>>
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free.
>>>>> www.avg.com
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>> <#m_3367736491194591486_m_5825202310365730355_m_-2580758597909172818_m_-2322454506161063817_m_-7043823991385307789_m_3768997358934924510_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>
>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Jan 19 2021 - 09:31:55 CET

Original text of this message