Re: Odd wait event freezing database

From: Jonathan Lewis <jlewisoracle_at_gmail.com>
Date: Thu, 8 Apr 2021 20:30:37 +0100
Message-ID: <CAGtsp8n_PWgLqx3KyC0XWMvT9ycBPaOJkC7+dR0iqtOXnmgoog_at_mail.gmail.com>



I nearly agree with Sayan about the "long checkpoint".but I think it might be indirect.
You're on 11g, so I THINK the sequence of events is something like the following (but Sayan may well correct me):

Session wants to do direct path read (cell smart scan for Exadata). Session sends message to CKPT to do object checkpoint and waits for CKPT to return a reliable message
CKPT sends DBWR an object checkpoint request and waits for DBWR to acknowledge
Once DBWR has acknowledged, CKPT takes a KO enqueue in exclusive mode and sends a message to the session
The session tries to acquire a KO enqueue in share mode (mode 4) and ends up waiting on CKPT.
When CKPT finds the checkpoint it complete it releases its exclusive lock - which allows the session to get its lock, release it, and continue.

If that's correct then it looks like CKPT has a problem getting a message to DBWR, or DBWR had a problem acknowledging CKPT. If might be revealing to check v$active_session_history for that 5 minutes (or dba_hist_active_sess_history) to see what waits appear for CKPT and/or DBW% as that may give some clues. (Of course with RAC the object checkpoint has to propagate across all instances, so that may complicate the contents of ASH)

Regards
Jonathan Lewis

On Thu, 8 Apr 2021 at 19:30, Lok P <loknath.73_at_gmail.com> wrote:

> Hi All, Its version 11.2.0.4 of Oracle exadata is a 4 node RAC database.
> We are seeing one of the query runs normally finish in a few seconds but
> sometimes it runs for 3-4 minutes with the wait event being noted as
> "reliable message" and during that time period things seem to freeze in the
> database almost all the nodes getting stuck. So I am not sure if this query
> is the cause of the slowness or the victim, but it seems whenever such an
> issue occurred this query was getting executed from multiple sessions and
> was running longer than expected time. No change in plan happened for this
> query and with the same plan it used to finish in seconds during other
> times. So wanted to understand if we are hitting any bug around this wait
> event as this looks a bit unusual? It seems happening while scanning mostly
> table TSFS in FULL , want to understand what's wrong with scanning table
> TSFS?
> Below attached is the sql monitor for the same query which is showing all
> time(~200+ seconds) being spent on event "reliable message" only.
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Apr 08 2021 - 21:30:37 CEST

Original text of this message