Re: cursor: mutex S + library cache lock + library cache: mutex X = disaster

From: natalka roshak <natalka.roshak_at_oracle.com>
Date: Mon, 03 Nov 2014 16:30:24 -0500
Message-ID: <5457F3F0.1090804_at_oracle.com>



Hi Nico, did you take a hanganalyze dump? "oradebug hanganalyze 3"

The hanganalyze dump shows wait chains from the hang analysis cache, also accessible via v$wait_chain. I haven't used snapper, but I googled it & it doesn't seem to query v$wait_chains.

If you didn't happen to take a hanganalyze dump: I haven't tried this AT ALL, but I just ran across another script by snapper's author Tanel Poder, which seems to pull wait chains from the active session history: http://blog.tanelpoder.com/files/scripts/ash/ash_wait_chains.sql or http://blog.tanelpoder.com/files/scripts/ash/dash_wait_chains.sql

-Natalka

--
Natalka Roshak | RAC/Storage BDE
Oracle <http://www.oracle.com/index.html>Global Product Support
45 O'Connor St, Suite 400 | Ottawa, Canada
On 03/11/2014 2:24 PM, nico torres wrote:

> John:
>
> The machine has 16 cores, 2 sockets, it's a virtual machine on a
> server cluster, what should I look into?
>
> Thanks!
>
> Nico
>
> 2014-11-03 15:47 GMT-03:00 John Clarke <jcclarke00_at_gmail.com
> <mailto:jcclarke00_at_gmail.com>>:
>
> Nico,
>
> Sounds like it could be CPU oversubscription - while there are
> bugs related to cursor%mutex% waits, I would this first.
>
> How many CPU cores are there on your machine?
>
> On Mon, Nov 3, 2014 at 9:58 AM, nico torres <frasebook4_at_gmail.com
> <mailto:frasebook4_at_gmail.com>> wrote:
>
> Hi, list.
> I've had this situation last saturday. I got a call, telling
> me that no application could get into the database; checked
> alert log, ORA-00018: maximum number of sessions exceeded
> everywhere.
> Db host felt kind of clunky and slow, so I checked htop, 100%
> utilization on all CPU's, around *462 load*, wow.
> Tried to log / as sysdba, naturally couldn't, tried
> with -prelim flag, neither.
> I finally could get into the database, and immediately run
> snapper, and what I saw was all sessions waiting for "*cursor:
> mutex S*", with no running queries, no sql_id available. So I
> came into the conclussion that all sesions hung with the same
> wait, and in the same time, applications logged in more and
> more sessions, so it reached max sessions and nobody else
> could login.
> All of those sessions had different blockers, blocking each
> other, and different object#, so I coulnd't find a main
> culprit either.
> Tried killing those sessions, couldn't; Tried closing the
> instance with a shutdown immediate and it got hung, so I ended
> up issuing a shutdown abort. Instance closed, and started up
> pretty fast and things got back to normal.
> Looking at AWR snaps from that timestamp, I got:
>
>
> Top 5 Timed Foreground Events
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Avg
> wait % DB
> Event Waits Time(s) (ms) time
> ------------------------------ ------------ ----------- ------
> ------
> cursor: mutex S 20,166,853 1,057,135 52 86.5
> library cache lock 90,243 91,516 1014 7.5
> DB CPU 43,678 3.6
> library cache: mutex X 74,002 10,816 146 .9
> db file sequential read 276,847 1,417 5 .1
>
> As expected, lots of "cursor: mutex S", but also Library cache
> lock and library cache:mutex X.
>
> Load Profile Per Second Per Transaction
> ~~~~~~~~~~~~ --------------- ---------------
> DB Time(s): 249.6 1.4
> DB CPU(s): 8.9 0.1
> Redo size: 77,451.6 438.3
> Logical reads: 29,037.2 164.3
> Block changes: 320.9 1.8
> Physical reads: 70.3 0.4
> Physical writes: 23.5 0.1
> User calls: 755.3 4.3
> Parses: 197.7 1.1
> Hard parses: 0.2 0.0
> W/A MB processed: 0.1 0.0
> Logons: 1.9 0.0
> Executes: 203.8 1.2
> Rollbacks: 173.7 1.0
> Transactions: 176.7
>
> Hard parses not a concern here.
>
> Statistic Name Time (s) % of
> DB Time
> ------------------------------------------ ------------------
> ------------
> parse time elapsed 1,217,693.8 99.6
> connection management call elapsed time
> 868,991.0 71.1
> DB CPU 43,677.7 3.6
>
> All sessions clearly hung at parse time.
>
> Foreground Wait Events
> cursor: mutex S 20,166,853 100 1,057,135
> 52 23.3 86.5
> library cache lock 90,243 0 91,516 1014 0.1
> 7.5
> library cache: mutex X 74,002 0 10,816
> 146 0.1 .9
> db file sequential read 276,847 0 1,417 5
> 0.3 .1
> cursor: pin S wait on X 94 0 164 1749 0.0 .0
>
> I searched around MOS and some bugs appeared, but couldn't
> find one that matches entirely.
>
> Some information about the database:
> -------------------------
>
> Oracle *11.2.0.1*/, /it needs some patching urgently and a bug
> is my main suspect.
>
> It uses ASSM
> begin end
> SGA use (MB): 6,656.0 6,656.0
> PGA use (MB): 561.1 1,213.3
>
> Cache sizes between the snaps:
> Buffer Cache: 3,840M
> Shared Pool Size: 2,496M
>
>
> Parameters:
>
> memory_target: 13421772800
> open_cursors 300
> processes 600
> sessions 928
> session_cached_cursors 50
>
> Could anyone please give me some advice on how to further
> investigate this issue? Any other information needed?
>
> Thanks
>
>
>
>
>
>
>
-- http://www.freelists.org/webpage/oracle-l
Received on Mon Nov 03 2014 - 22:30:24 CET

Original text of this message