Re: Understanding "cursor: pin S wait on X"

From: Thomas Kellerer <thomas.kellerer_at_mgm-tp.com>
Date: Tue, 09 Apr 2013 13:14:17 +0200
Message-ID: <5163F809.7070103_at_mgm-tp.com>



Tanel,

thanks for the quick reply.

> 1) The cursor got flushed out for some reason (or it's just the first
> time anyone is running it) and the other session is currently hard
> parsing/compiling/loading the contents of this cursor back into
> library cache. The longer the parsing/loading takes, the longer you
> wait. If the parsing session hits some (performance) issues, this pin
> may be held for a long time.

To make sure I understood this correctly: the pin is only held during parsing/loading, *not* while the cursor is executing (and fetching rows), correct?

But if this is really caused by parsing, shouldn't I also see a high number of hard parses in my AWR report? There where only 865 hard parses and 155,291 total parses during that time.

> 2) The child cursor is already loaded into cache, but someone else is
> currently flushing it out (shared pool shortage, fragmentation &
> loading of a big object, SGA target / ASMM reducing the SGA size).

There are 3 grow operations and one shrink operation recorded in the AWR for the node in question. In overall it grew from 3.2G to 3.4G during that hour (SGA Target is 30G, with AMM enabled, if I'm not mistaken)

> The thing with "auto-extending" connection pools is that there's a
> chicken-and-egg situation with troubleshooting performance &
> utilization spikes - what caused what? Sometimes there's just a mild
> performance hiccup for whatever reason, which nobody would even
> noticed, but a trigger-happy connection pool will greatly amplify it
> by starting up hundreds of new connections during the problem time.

One problem was, that the pools on the application servers were configured to allow way too many connections (if all of them had extended to the max size, this would have been more than twice the number of connections allowed on the RAC system).

The first thing we did was to cap the number to reasonable limit on each app server.

(Additionally: the application has a "reputation" of doing this sort of things. The root cause is almost certainly inside the application or the underlying E-Commerce platform.

> But if you are talking about 0.6 logons per second *on average in a 1-hour AWR report*

Yes, that's what I'm talking about.   

>then this could mean also (let's say) 40 new connections per second during your spike time (of 60 seconds for example)
> and then no more logins throughout the rest of the hour. 60 x 50 / 3600 = 0.66...

> So if your spike lasted for X seconds, then your performance data
> should also be from X seconds (or even more granular)

The spike lasted over 3 hours (if I'm not mistaken). The AWR was taken from one hour during that time.

> If you want to reproduce "cursor: pin S for X" waits, you can use one of my seminar demo scripts
> (http://blog.tanelpoder.com/files/scripts/aot/demo4.sql ). Just run the entire script in one session
>(it will pin a child and end up hard parsing for a very long time) and then run the select part of that script
> only in a different session. But my demo very narrowly focuses on causing the wait event problem, I don't think it will
> reproduce anything like you saw.

I'm trying to understand the mechanics/reasons that lead to this situation. If I understand it correctly, the demo is running a statement that takes really long to parse and therefor creates the wait event, right?

> Another option would be to get 10s or hundreds of sessions running the same SQL statement (and overload the CPUs)
>and then purging the cursor with DBMS_SHARED_POOL.PURGE (not just flush shared pool as it will skip pinned chunks,
>but purge will wait until it gets the pin).

I'll try that as well. Could the "logon storm" could cause a flushing of the cursors from the shared pool, requiring each new session to re-parse the statements? (But then again: hardly any hard parses during that time).

Regards
Thomas

> On Tue, Apr 9, 2013 at 11:59 AM, Thomas Kellerer <thomas.kellerer_at_mgm-tp.com <mailto:thomas.kellerer_at_mgm-tp.com>> wrote:
>
> Hello all,
>
> we had a situation on a 2 node RAC system where for some reason the system more or less stopped working due to excessive "cursor: pin S wait on X" waits. (89% of the DB time on one node, 36% on the other node).
>
> This seems to be the result of the application servers opening an excessive high number of connections to the RAC (the reason for that is yet unknown).
>
> We are not done invstigating everything that happened on the application server yet, but it seems to be that all those sessions were running the same statements and for some reason they were all routed to the same RAC node.
>
> The top 5 Wait events from the overloaded node show up like this:
>
> Event Waits Time(s) Avg wait (ms) % DB time Wait Class
> cursor: pin S wait on X 4,411 429,949 97472 88.68 Concurrency
> library cache lock 246 30,608 124424 6.31 Concurrency
> db file sequential read 1,237,878 8,275 7 1.71 User I/O
> DB CPU 4,728 0.98
> db file scattered read 1,014,046 2,738 3 0.56 User I/O
>
> (1 hour AWR report)
>
> In order to understand better the backround on when a "cursor: pin S wait on X" is requested, I tried to create a test environment to reproduce this. But up to now without luck. The "best" I can do simulating the high concurrency (on a test environment) will result in a high number of "latch: shared pool" and "latch: row cache objects".
>
> So my question is:
>
> Could somebody explain in more detail at which step(s) in the query processing a "cursor: pin S wait on X" is requested?
> By understanding the sequence that leads to such a wait event I hope to understand better what was going on on the Oracle side of the problem.
>
> Btw: is 0.6 logons per second considered a "logon storm"?
>
> Thanks in advance
>
> Kind regards
> Thomas

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Apr 09 2013 - 13:14:17 CEST

Original text of this message