Re: Odd wait event freezing database

From: Lok P <loknath.73_at_gmail.com>
Date: Sat, 24 Apr 2021 18:05:26 +0530
Message-ID: <CAKna9VbKg1eEPEGM3HzpreUVWpi5JAnKCLSQosA=VEb3fdgdJg_at_mail.gmail.com>



One of the latest suggestions we are getting is to set below additional parameters along with _bct_buffer_allocation_max(which we already increased to ~1GB). But what we are wondering is ,in below the _bct_public_dba_buffer_size has been set to ~190MB which is smaller than current value(i.e. ~256MB) will it really going to help or going to make situation worse?

We are thinking to atleast set first two parameter from below as those are online changes. Last one ( _high_priority_processes) will need DB reboot, so we are thinking to apply afterwards when we can afford downtime. But we are still checking if _bct_public_dba_buffer_size should be increased higher from current ~256Mb rather getting decreased.

alter system set "_bct_public_dba_buffer_size"=200000000 scope=both sid='*';

alter system set "_bct_health_check_interval"=0 scope=both sid='*';

alter system set "_high_priority_processes"=CTWR scope=both sid='*';

On Sat, Apr 24, 2021 at 5:33 PM Lok P <loknath.73_at_gmail.com> wrote:

> Thank You Jonathan. Yes the BCT file size was down from 22GB to 9GB after
> we reinitiated.
>
> Total size of dba_data_files= 148TB,
> Count of data file = 3834
> Its ZDLRA RMAN backup in place which means we have only one L0(FULL) i.e.
> the very first backup and rest all subsequent backups are L1(Incremental)
> only.
> Its a 4-Node RAC and database version is 11.2.0.4.
>
> One doubt though , I am not very sure if its related , but it seems the
> issue surfaced after we created one new big file tablespace. Actually we
> were having all the normal tablespaces but for one of the tablespace we
> reached the max number of data file per tablespace limit. So we created big
> file tablespace, so not sure if it has any role to play here.
>
> Another observation is , last time I remember, when we encountered the
> issue in month of february , we were seeing lot of "bct buffer space" wait
> but very minimal "reliable message" wait. And we were seeing those waits
> ("bct buffer space" ) experienced by DMLS continuously and we were stuck.
> But this time we are seeing a lot of "reliable message waits" but minimal
> "BCT buffer space" waits and this is intermittent. So this time when this
> waits(reliable message) appears thing gets just hung and then after 30-45
> minutes it recovers automatically.
>
> And last time (in feb) as suggested by the oracle , we did increased the
> _bct_buffer_allocation_max to 640MB and large_pool size to 2GB but that
> didn't help us. The again we reached out to oracle and they suggested to
> increase the size of _bct_public_dba_buffer_size from ~128MB to ~256MB and
> reinitiated the BCT and this helped in mitigating the issue. But it
> resurfaced after 2months again now(i.e. in April this time).
>
> And this time we tried by simply reinitiating BCT multiple times as an
> work around and the issue is mitigating immediately but is surfacing after
> 2days from reinitiating the BCT while the BCT file size remains <10GB. So
> definitely its not just the BCT filesize causing this but something else.
> And this time, Oracle insisted to again bump _bct_buffer_allocation_max
> to ~1GB , but we already did it and it didn't fix the issue so far. So we
> were thinking if just like last time, increasing size of
> _bct_public_dba_buffer_size further (may be ~1GB) will fix this issue. But
> we have not got nod from Oracle yet on this. Do you think we should
> consider increasing _bct_public_dba_buffer_size ?
>
> Regards
> Lok
>
>
>
> On Sat, Apr 24, 2021 at 3:20 PM Jonathan Lewis <jlewisoracle_at_gmail.com>
> wrote:
>
>>
>> Just finding time to look at this - it's something I'd need to research
>> since I don't have anything but the barest of old notes in my library.
>> I know I sent you an indication of how large the BCT file probably ought
>> to be for your size database, but I can't find that message at present and
>> I may have a correction to it that explains the 22GB / 9GB.
>>
>> How big is the database - sum(bytes)/(2^9), count(*) from dba_data_files
>> How many nodes (4 ?)
>> How many backups are you keeping before obsoleting them.
>> What was the version of Oracle (I keep forgetting)
>>
>> I think I said that you need to estimate 1 bit per 32KB of data file,
>> multiplied by 8 as the total number of bitmaps - then allow rounding errors.
>> I've just found a note on MOS which says you also need to multiply by the
>> number of threads - and I don't think I knew that (though it makes sense
>> because you don't want the the instances to have to tell each other about
>> changes they've made and you don't want CTWRs to collide while trying to
>> write the same bits of bitmap).
>>
>> -----
>>
>> The "checkpoint pattern" for waits for reliable message is one that could
>> be repeated here - process A calls process B to get something done and
>> waits on a reliable message, process B calls process C to perform an
>> action, takes an exclusive lock, then messages process A which tries to
>> take a shared lock; process B waits for process C to do its thing and
>> release its lock when process C completes, which allows process A to get
>> (and release) its lock.
>>
>> What we need to find out is (a) is this what's happening then (b) who are
>> process B and C, then (c) why is it taking so long for process B to get
>> acknowledgement from process C that it's working on the required task. (If
>> process A is waiting on reliable message this means process B hasn't heard
>> from process C or process B is unable to gets its exclusive lock).
>>
>> A first step would be to get the wait chain analysis from
>> g$active_session_history as soon as possible after one of these long waits
>> has happened.
>>
>>
>> I'll be reading your previous couple of emails over the weekend and
>> hunting around MOS to see what I can find. I am a little curious that you
>> mention that CTWR writes on a checkpoint - this doesn't seem to be
>> necessary (or a good idea) - but that may be explained in one of the
>> notes/bugs you've found.
>>
>> A couple of speculative thoughts:
>> 1) possibly the problem appears when the BCT file has to be extended by
>> multiple instances at the same time and they have to queue.
>> 2) maybe the buffer space waits follow the reliable message wait because
>> a lot of change has happened in the interval and the bitmap changes have
>> been queued up so long that they keep filling the buffer and waiting for
>> space, so maybe increasing the publc buffer size again would reduce those
>> waits (while we try to find out why the reliable message waits are
>> happening).
>>
>> Regards
>> Jonathan Lewis
>>
>>
>> On Fri, 16 Apr 2021 at 20:36, Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> We have re-initiated the BCT by disabling and then enabling it again and
>>> then we saw the BCT file size decreased from 22GB to ~9GB and the wait
>>> events "bct buffer space" and "reliable message" went away, everyone was
>>> happy. But again after 2 days we again started seeing the "reliable
>>> message" wait event popping up along with a few "bct buffer space" waits.
>>> We are thinking of reaching out to Oracle through SR to see what is causing
>>> this issue. If it's anything related to our current ZDLRA backup strategy
>>> which has only one L0 and all L1's or something else is causing this issue?
>>> Or we really have to increase the BCT buffer(
>>> _bct_public_dba_buffer_size ) till ~1GB(which is the max size for it).
>>>
>>> On Sun, Apr 11, 2021 at 12:53 AM Lok P <loknath.73_at_gmail.com> wrote:
>>>
>>>> As I read it , when a checkpoint occurs the change tracking writer i.e.
>>>> CTWR will write the data from the CTWR DBA buffer to the BCT file, so we
>>>> are seeing this BCT Buffer space wait + reliable message waits means the
>>>> check point is not finishing up and getting stuck. So does it mean that
>>>> when we issue COMMIT during that point itself the changed data
>>>> is struggling to get written to CTWR buffer because of its so small space?
>>>> But again if it's because of the small size of CTWR buffer being small then
>>>> how come initiating BCT file fixing such an issue?
>>>>
>>>>
>>>>
>>>> On Sun, Apr 11, 2021 at 12:25 AM Lok P <loknath.73_at_gmail.com> wrote:
>>>>
>>>>> Below is the bug that seems to be exactly what we are suffering from
>>>>> as it shows how "check point not completing" and "reliable message" are
>>>>> associated with this BCT issue. In past we had suffered mainly from lot of
>>>>> "BCT Buffer space waits" and then we found BCT file size was 22GB and as
>>>>> suggested by Oracle, we endup increasing the size of
>>>>> _bct_public_dba_buffer_size from ~128MB to ~256MB and re-initialized the
>>>>> BCT i.e. disbaled+enabled and that fixed the issue then. The size of the
>>>>> BCT file after reinitialization was down to ~9 GB.
>>>>>
>>>>> But it seems now within ~2months the BCT file has again grown back to
>>>>> same ~22GB and this time we are seeing combination of waits like "reliable
>>>>> message" +"BCT buffer space waits", even with that
>>>>> _bct_public_dba_buffer_size being set as ~256MB, so it seems this doubling
>>>>> of buffer size has not helped then rather re-initialization of BCT helped.
>>>>> So this time we are going to do the same reinitializing of BCT to have a
>>>>> temporary fix.But now wondering what will be a permanent fix for this issue?
>>>>>
>>>>> High Waits On 'block change tracking buffer space' - Checkpoint
>>>>> Contention With BLOCK CHANGE TRACKING or RMAN Incremental Backup (Doc ID
>>>>> 2094946.1)
>>>>>
>>>>> On Sat, Apr 10, 2021 at 12:53 AM Lok P <loknath.73_at_gmail.com> wrote:
>>>>>
>>>>>> We have verified many occurrences and it looks like each time the
>>>>>> "reliable message" is followed by many "block change tracking buffer
>>>>>> space" wait events. We are thinking of initiating BCT by disabling +
>>>>>> enabling the BCT. but still unable to relate how come the BCT Buffer space
>>>>>> can cause "reliable message" wait events making the database freeze?
>>>>>>
>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Apr 24 2021 - 14:35:26 CEST

Original text of this message