Re: Odd wait event freezing database

From: Lok P <loknath.73_at_gmail.com>
Date: Sat, 17 Apr 2021 01:06:33 +0530
Message-ID: <CAKna9Va4WDt+jTnpsqhfDsEWJhFqXK6xq=iZS=kBSqVxzFEwWw_at_mail.gmail.com>



We have re-initiated the BCT by disabling and then enabling it again and then we saw the BCT file size decreased from 22GB to ~9GB and the wait events "bct buffer space" and "reliable message" went away, everyone was happy. But again after 2 days we again started seeing the "reliable message" wait event popping up along with a few "bct buffer space" waits. We are thinking of reaching out to Oracle through SR to see what is causing this issue. If it's anything related to our current ZDLRA backup strategy which has only one L0 and all L1's or something else is causing this issue? Or we really have to increase the BCT buffer( _bct_public_dba_buffer_size ) till ~1GB(which is the max size for it).

On Sun, Apr 11, 2021 at 12:53 AM Lok P <loknath.73_at_gmail.com> wrote:

> As I read it , when a checkpoint occurs the change tracking writer i.e.
> CTWR will write the data from the CTWR DBA buffer to the BCT file, so we
> are seeing this BCT Buffer space wait + reliable message waits means the
> check point is not finishing up and getting stuck. So does it mean that
> when we issue COMMIT during that point itself the changed data
> is struggling to get written to CTWR buffer because of its so small space?
> But again if it's because of the small size of CTWR buffer being small then
> how come initiating BCT file fixing such an issue?
>
>
>
> On Sun, Apr 11, 2021 at 12:25 AM Lok P <loknath.73_at_gmail.com> wrote:
>
>> Below is the bug that seems to be exactly what we are suffering from as
>> it shows how "check point not completing" and "reliable message" are
>> associated with this BCT issue. In past we had suffered mainly from lot of
>> "BCT Buffer space waits" and then we found BCT file size was 22GB and as
>> suggested by Oracle, we endup increasing the size of
>> _bct_public_dba_buffer_size from ~128MB to ~256MB and re-initialized the
>> BCT i.e. disbaled+enabled and that fixed the issue then. The size of the
>> BCT file after reinitialization was down to ~9 GB.
>>
>> But it seems now within ~2months the BCT file has again grown back to
>> same ~22GB and this time we are seeing combination of waits like "reliable
>> message" +"BCT buffer space waits", even with that
>> _bct_public_dba_buffer_size being set as ~256MB, so it seems this doubling
>> of buffer size has not helped then rather re-initialization of BCT helped.
>> So this time we are going to do the same reinitializing of BCT to have a
>> temporary fix.But now wondering what will be a permanent fix for this issue?
>>
>> High Waits On 'block change tracking buffer space' - Checkpoint
>> Contention With BLOCK CHANGE TRACKING or RMAN Incremental Backup (Doc ID
>> 2094946.1)
>>
>> On Sat, Apr 10, 2021 at 12:53 AM Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> We have verified many occurrences and it looks like each time the
>>> "reliable message" is followed by many "block change tracking buffer
>>> space" wait events. We are thinking of initiating BCT by disabling +
>>> enabling the BCT. but still unable to relate how come the BCT Buffer space
>>> can cause "reliable message" wait events making the database freeze?
>>>
>>> On Fri, Apr 9, 2021 at 5:27 PM Lok P <loknath.73_at_gmail.com> wrote:
>>>
>>>> Resending, as it seems bounced back from the list server.
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Lok P <loknath.73_at_gmail.com>
>>>> Date: Fri, Apr 9, 2021 at 5:08 PM
>>>> Subject: Re: Odd wait event freezing database
>>>> To: Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>> Cc: Oracle L <oracle-l_at_freelists.org>
>>>>
>>>>
>>>> Thank you Sayan, Jonathan.
>>>>
>>>> I checked one occurrence of the issue from
>>>> dba_hist_active_sess_history. The first occurrence of "reliable message" is
>>>> logged at 4/8/2021 9:50:39.957 AM on NODE-4 against that sql along with a
>>>> few other SELECT queries. And then I fetch all the ASH data starting 9:49AM
>>>> and attached(dash_dump.xlsx) is the results from
>>>> dba_hist_active_sess_history starting 9:49 on 8th April for all the four
>>>> nodes. What I see is mainly a BCT related wait i.e. "block change tracking
>>>> buffer space" event and I am not sure how this can be related to the actual
>>>> issue. I am not seeing any such event related to DBWR or CKPT though just
>>>> before 9:50.
>>>>
>>>> I have attached the sql plan(sql_queryplan.txt) with the predicate
>>>> section for the one which i mentioned in my initial post and it seems that
>>>> is the one suffering heavily. And I am not sure if it's the cause or the
>>>> victim. I do see that BCT waits were logged against some key inserts to
>>>> transaction tables.
>>>>
>>>> Regards
>>>> Lok
>>>>
>>>> On Fri, Apr 9, 2021 at 1:01 AM Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>> wrote:
>>>>
>>>>> I nearly agree with Sayan about the "long checkpoint".but I think it
>>>>> might be indirect.
>>>>> You're on 11g, so I THINK the sequence of events is something like the
>>>>> following (but Sayan may well correct me):
>>>>>
>>>>> Session wants to do direct path read (cell smart scan for Exadata).
>>>>> Session sends message to CKPT to do object checkpoint and waits for
>>>>> CKPT to return a reliable message
>>>>> CKPT sends DBWR an object checkpoint request and waits for DBWR to
>>>>> acknowledge
>>>>> Once DBWR has acknowledged, CKPT takes a KO enqueue in exclusive mode
>>>>> and sends a message to the session
>>>>> The session tries to acquire a KO enqueue in share mode (mode 4) and
>>>>> ends up waiting on CKPT.
>>>>> When CKPT finds the checkpoint it complete it releases its exclusive
>>>>> lock - which allows the session to get its lock, release it, and continue.
>>>>>
>>>>> If that's correct then it looks like CKPT has a problem getting a
>>>>> message to DBWR, or DBWR had a problem acknowledging CKPT.
>>>>> If might be revealing to check v$active_session_history for that 5
>>>>> minutes (or dba_hist_active_sess_history) to see what waits appear for CKPT
>>>>> and/or DBW% as that may give some clues. (Of course with RAC the object
>>>>> checkpoint has to propagate across all instances, so that may complicate
>>>>> the contents of ASH)
>>>>>
>>>>> Regards
>>>>> Jonathan Lewis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 8 Apr 2021 at 19:30, Lok P <loknath.73_at_gmail.com> wrote:
>>>>>
>>>>>> Hi All, Its version 11.2.0.4 of Oracle exadata is a 4 node RAC
>>>>>> database. We are seeing one of the query runs normally finish in a few
>>>>>> seconds but sometimes it runs for 3-4 minutes with the wait event being
>>>>>> noted as "reliable message" and during that time period things seem to
>>>>>> freeze in the database almost all the nodes getting stuck. So I am not sure
>>>>>> if this query is the cause of the slowness or the victim, but it seems
>>>>>> whenever such an issue occurred this query was getting executed from
>>>>>> multiple sessions and was running longer than expected time. No change in
>>>>>> plan happened for this query and with the same plan it used to finish in
>>>>>> seconds during other times. So wanted to understand if we are hitting any
>>>>>> bug around this wait event as this looks a bit unusual? It seems happening
>>>>>> while scanning mostly table TSFS in FULL , want to understand what's wrong
>>>>>> with scanning table TSFS?
>>>>>> Below attached is the sql monitor for the same query which is showing
>>>>>> all time(~200+ seconds) being spent on event "reliable message" only.
>>>>>>
>>>>>>
>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Apr 16 2021 - 21:36:33 CEST

Original text of this message