Re: DFS file handle / CI enq.

From: Priit Piipuu <priit.piipuu_at_gmail.com>
Date: Fri, 6 Nov 2020 15:07:53 +0100
Message-ID: <CAJYY02g8C3dk_05QqzRrpLuCyrUFQjmekChtf=FaW2H4xvdFpA_at_mail.gmail.com>



Hi,

Did you check wait chain signatures from v$wait_chains or hanganalyze dumps? These include background processes being signalled.

On Thu, 5 Nov 2020 at 23:32, Henry Poras <henry.poras_at_gmail.com> wrote:

> I’m getting myself tied in knots trying to debug a sporadic, yet
> bothersome issue. The system slows/hangs with lots of ‘DFS lock handle’
> waits. My understanding (lots of which comes from Riyaj’s article
> https://orainternals.wordpress.com/2011/11/08/troubleshooting-dfs-lock-handle-waits/)
> is that this wait event waits for the lock handle of a *global* lock
> request. There are a number of different global resources it might be
> trying to obtain (determined from p1 parameter). In our case we are waiting
> for a CI enqueue (id1=10, id2=5), trying to obtain it in Exclusive mode. A
> CI enqueue, a Cross Instance call, is “used to invoke specific actions in
> background processes”. So, if our waiting process needs a background
> process (local or remote) to do work for it, it uses a CI enqueue (the
> background process holds the enqueue until the work is done. Our process
> knows the work is done when its DFS wait no longer waits for the CI
> resource).
>
> We can also find out more about the CI resource from id1 and id2. This
> data is fairly limited and this is where I start to lose it. In our case
> id1=10, which the docs show is ‘Create Remote parallel query Server’, and
> id2 = 5 which is ‘Used to queue up interested clients’. The meaning of
> id2=5 makes it sound like I am queued up trying to do some real work with
> another enqueue with another id2 value. (I’ll post other id2 values in a
> second so you can see what I mean.) What is odd, is its behavior is very
> different from other enqueue waits with which I am familiar. Typically,
> three double linked lists (owners, waiters, converters queue) hang off from
> the requested resource. For the CI enqueue it looks like we hop from
> resource to resource, from id2 to id2. So the process holding the CI
> enqueue (id1=10, id2=5) is not the final blocker. We might actually want
> the CI enq in id2=1, which is held by another process.
>
> [Some values for id2
>
> 1 used to pass in parameters
> 2 used to invoke the function in backgroud process
> 3 used to indicate the foreground has not returned
> 4 mounted excl, use to allocate mechanism
> 5 used to queue up interested clients]
>
>
>
> To summarize:
>
> · A local process, waiting on ‘DFS lock handle’, needs a background
> process to do work on local and/or remote nodes (e.g. a Query Coordinator
> needs slave parallel processes)
>
> · The background process is pinged (more on this to follow) telling
> it to grab the CI enqueue with id1=10 (Create Remote parallel query
> Server), id2=5 (Used to queue up interested clients). This tells a
> specific background process what work needs to be done. Additional
> information (including *lock pointer, resource pointer, and resource
> name)* can also be sent via the message (ping) to the resource’s value
> block.
>
> · While the background process works, it holds the CI enqueue for
> which the (QC) process waits on the ‘DFS lock handle’ wait. It waits until
> the background process completes its work and releases the lock. This is
> how our process (QC in this case) knows the necessary work is complete and
> it can continue.
>
> My understanding is if a server process wants a lock on a CI resource, it
> will contact a local GES process (e.g. LMD0) which will then message the
> same process (LMD0) on the resource’s master node. The master will then
> contact the owner and requester.
>
> Messaging between the GES processes is done using Asynchronous System
> Traps (AST), which act like an interrupt to the process. These can be
> either asynchronous (AAST Acquisition AST) or blocking (BAST Blocking AST).
>
> *Some more detail*. Oracle automatically produced some ProcessState Dumps
> (I’m not sure what triggered it or what processes were dumped). Looking at
> some of these, we see that the ‘ges process’ is a child of the process, not
> of the session. Also, the ‘ges process’ has two ‘ges enqueues’ children.
> Excerpts from the queues shown for the ‘ges process’ and the two ‘ges
> enqueue’s are as follows:
>
>
>
> ges process:
>
> AST_Q:
>
> PENDING_Q:
>
> lp 0x1644b35440 gl KJUSERNL rl KJUSEREX rp 0x17bb7a82e0
> [0xa][0x1],[CI][ext 0x0,0x0]
>
> *master 3 pid 29614* bast 0 rseq 1 mseq 0
>
> history FREE > REF_RES > LOC_AST > FREE > REF_RES > GR2CVT
>
> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>
> GRANTED_Q:
>
> lp 0x1644b39f08 gl KJUSEREX rp 0x17b7ab6490 [0xa][0x5],[CI][ext
> 0x0,0x0]
>
> *master 1 pid 29614* bast 0 rseq 19 mseq 0
>
> history LOC_AST_FG > CLOSE > FREE > REF_RES > GR2CVT > LOC_AST_FG
>
> open opt KJUSERPROCESS_OWNED
>
>
>
> ges enqueues:
>
> GRANTED_Q :
>
> lp 0x1663ca0418 gl KJUSERPR rp 0x17bb7a82e0 [0xa][0x1],[CI][ext
> 0x0,0x0]
>
> master 3 pid 3889 bast 0 rseq 1 mseq 0
>
> history LOC_AST > LOC_AST > LOC_AST > LOC_AST > LOC_AST > LOC_AST
>
> open opt KJUSERPROCESS_OWNED
>
> CONVERT_Q:
>
> lp 0x1644b35440 gl KJUSERNL rl KJUSEREX rp 0x17bb7a82e0
> [0xa][0x1],[CI][ext 0x0,0x0]
>
> master 3 pid 29614 bast 0 rseq 1 mseq 0
>
> history FREE > REF_RES > LOC_AST > FREE > REF_RES > GR2CVT
>
> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>
> ----------------------------------------
>
>
>
> ges enqueues:
>
> GRANTED_Q :
>
> lp 0x1644b39f08 gl KJUSEREX rp 0x17b7ab6490 [0xa][0x5],[CI][ext
> 0x0,0x0]
>
> master 1 pid 29614 bast 0 rseq 19 mseq 0
>
> history LOC_AST_FG > CLOSE > FREE > REF_RES > GR2CVT > LOC_AST_FG
>
> open opt KJUSERPROCESS_OWNED
>
> CONVERT_Q:
>
> lp 0x1644b3c900 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
> [0xa][0x5],[CI][ext 0x0,0x0]
>
> master 1 pid 38233 bast 0 rseq 19 mseq 0
>
> history REF_RES > LOC_AST > CLOSE > FREE > REF_RES > GR2CVT
>
> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>
> lp 0x1644efbb60 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
> [0xa][0x5],[CI][ext 0x0,0x0]
>
> master 1 pid 2898 bast 0 rseq 19 mseq 0
>
> history REF_RES > LOC_AST > CLOSE > FREE > REF_RES > GR2CVT
>
> convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
>
>
>
> So it looks like there is an additional queue (pending_q) and we are
> trying to switch from one resource (id2=5) to another (id2=1). We are also
> going from a pid on node 1 to the same pid number on node 3. But if pid is
> ospid (as it seems to be from the State Dump header), how does that work?
> OS pids are fairly arbitrary. Also, is there any way to get more data about
> the apparent blocker (node 3; pid 3889) without a System State Dump?
>
>
>
> One other question. Some of the ges enqueues convert_q entries have a
> slightly different format. Instead of pid, it is owner, and bast 1. This
> looks like it is BAST (message) related, but I’m not sure.
>
> lp 0x12542ff100 gl KJUSERNL rl KJUSEREX rp 0x17b7ab6490
> [0xa][0x5],[CI][ext 0x0,0x0]
>
> master 1 *owner 3 bast 1* rseq 76 mseq 0x30004
>
> history MSGSENT > GR2CVT > REM_AST > REM_BAST > MSGSENT > GR2CVT
>
> convert opt KJUSERGETVALUE
>
>
>
> Finally, the number of ‘DFS lock handle’ waits is growing faster than I
> can explain. We go from 10’s to hundreds of thousands within an hour. No
> sql statements execute at anywhere near that rate.
>
>
>
> So now I have a few problems. It seems like the issue is most likely
> either the background process at the head of the queues, or an issue with
> the inter-process messaging (AST, BAST).
>
> The major places I am getting stuck are:
>
> 1. What is at the head of the queue? From the ProcessStateDump it
> looks like node 3; pid 3889.
>
> a. Is there a way to confirm this?
>
> b. Since this is an ospid how can I track it? (do I need a
> SystemStateDump?)
>
> c. I assume it is a background process at the front of the line.
> How can I tell why it is stuck? What it is doing?
>
> 2. Could the issue be messaging (AST, BAST) and not process
> related? How do I check?
>
> 3. How do I get any of this information? (using a SystemState Dump?
> Is there another way?)
>
> Any suggestions as to what is going on and how to best continue to
> investigate this?
>
>
>
> Thanks.
>
> Henry
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Nov 06 2020 - 15:07:53 CET

Original text of this message