Re: Oracle 10g "Ghost" SID Eating up CPU

From: Tim Gorman <tim_at_evdbt.com>
Date: Wed, 02 Jun 2010 07:02:21 -0600
Message-ID: <4C06565D.6010005_at_evdbt.com>




  


Kellyn,

Rather than just bouncing the database instance, especially if there is no direct proof that application performance is impacted, then it is better to stick it out and research it into the ground.

If there is no process in the OS, then how is it possible that CPU is being consumed?  By what OS utility is that observation coming from?  What is the chain of evidence indicating massive CPU consumption from outside the Oracle database instance?

You can use "top" or a hand-rolled version (i.e. "ps -eaf | sort -n +3 | tail") next time this happens, to see if any process on the server is eating up CPU.  The nice thing about the hand-rolled "ps" command is that it doesn't just show you the executable name (i.e. "oracle") as "top" does, but shows you the functional command-name (i.e. "ora_pmon_XXX", etc) that "ps -eaf" shows, which can be helpful.  Whatever you find, you can backtrack the OS PID into the database using "select * from v$session where paddr in (select addr from v$process where spid = NNNN)" and see whether it is a "ghost" session or something more normal and expected.

Also, from within Oracle, what are the values for STATUS, EVENT, P1, P2, P3 for this session?  If the session seems to be waiting on something, then it is further unlikely that CPU is being consumed.  Particularly if the wait-event seems to be related to an OS system call, such as I/O.

If the Oracle instance is holding a session open when its underlying OS process has actually been killed, then certainly it is not chewing up CPU, but it may be holding locks, which could possibly be more devastating.  You may want to query V$LOCK for that session's SID value to see what, if any, locks are being held.  More importantly, you would want to look at the BLOCKS column in V$LOCK to see if the session is blocking other sessions (i.e. BLOCKS > 0) -- bear in mind that the column name (BLOCKS) is being used as a verb in the context of V$LOCK, not as a noun.  If BLOCKS > 0, you may want to look at what those other sessions are by querying V$LOCK again, this time by the same TYPE and ID1 value as the blocking lock.

-----

As far as why this is happening, there should be trace files left behind in the BACKGROUND_DUMP_DEST directory for the terminated Pnnn processes, and those trace files might give some general clue, in the first 50-100 lines of text.  If I had to guess off the top of my head, purely speculating, I would guess that you're encountering some OS resource limitation.  The reason for this guess is because the failures are occurring when the demand for OS resources are sky-high (i.e. 32 concurrent CTAS operations, each degree 4, etc).  Please verify that allocated swap space is 75% of physical RAM, that process limits are configured correctly.  There's more to check, and of course it varies by platform, so it can be fairly laborious when doing it the first time.

-----

For a faster way to do this, especially on a regular basis, the RDA HCVE (i.e. Health Check Validation Engine) for your RDBMS version and platform is pretty helpful and authoritative.  To download, go to support.oracle.com, search on keywords "rda hcve download", download the appropriate ".tar" file to your database server, unpack, and just run the command "./rda.sh -T hcve" and follow the shell-script prompts.  Be aware that HCVE is intended as a "pre-installation" validation, and that some of the validations will fail expectedly in an already-installed environment (i.e. "LD_LIBRARY_PATH is set", etc).  If you need help interpreting the report, feel free to post back.  I get the feeling not many people know about HCVE.  I like to run it periodically on existing environments because the OS configuration gets changed as often as the RDBMS configuration, and a server which was configured correctly when Oracle was installed may no longer be configured within recommended guidelines.

Hope this helps...
Tim Gorman
consultant -> Evergreen Database Technologies, Inc.
postal     => P.O. Box 630791, Highlands Ranch CO  80163-0791
website    => http://www.EvDBT.com/
email      => Tim_at_EvDBT.com
mobile     => +1-303-885-4526
fax        => +1-303-484-3608
Lost Data? => http://www.ora600.be/ for info about DUDE...


David Fitzjarrell wrote:
Unfortunately cycling the database is your only option since there is no parent session nor any O/S process to kill.
 
David Fitzjarrell
 



From: Kellyn Pedersen <kjped1313_at_yahoo.com>
To: oracle Freelists <oracle-l_at_freelists.org>
Sent: Tue, June 1, 2010 2:25:39 PM
Subject: Oracle 10g "Ghost" SID Eating up CPU

The last two weekends, due to some new code, my main datawarehouse/OLTP, (yes, I know it's an oxy-moron and it's 10.2.0.4 on Linux, 64bit with a number of one off patches for parallel bugs...)  has been overwhelmed by 32 CTAS concurrently running, all requesting 4 parallel on large table selects.  Parallel was downgraded a number of times 75% or more during this step in their package. 
This is the second time I've come back in after the occurrance to find one parallel coordinator session running-  on it's own, no other producers/consumers, no parent SID, just this one process eating up CPU-
 
 SID SERIAL# STATUS OSUSER PROCESS MACHINE PROGRAM ROW_WAIT_OBJ# PDDL_STATUS PQ_STATUS EVENT 
P1TEXT                              SECONDS_IN_WAIT
 540 20564 ACTIVE sdev_user 31988 appmachine prodmachine (P039) 2815532 ENABLED ENABLED PX Deq: Execution  sleeptime/senderid 141170
 
If you try to search for the OS Process, (31988), it doesn't exist, the SQL_ID is unknown but I can see it was sitting on the primary key for a particular table, (although different one than the last time this ghost was present last week!)  What I believe happened is that the parallel query died, but the coorindator is still out there. 
 
ERROR at line 1:
ORA-12805: parallel query server died unexpectedly
I found a couple of these errors, (12805) in trace files from the times that parallel was downgraded.  The process doesn't exist on the app server, I don't have an OS PID to kill and I can't kill it at the Oracle session level, (ORA-00030:  User session ID does not exist.)  Last time we had a maintenance window and solved the problem quickly with a database cycle, but here I am again-  HOW DO I GET RID of this thing!?!?  It's starting to eat up CPU and won't die... :(
 
 SID PID Coordinator SPID  Group Set Degree Req Degree Wait Event
 540 134 540                31988                                                        PX Deq: Execution Msg
 
Anybody have any ideas?  I actually have two P039 processes in my database right now!  This cannot be good! :(
 

Kellyn Pedersen

Sr. Database Administrator

I-Behavior Inc.

http://www.linkedin.com/in/kellynpedersen

www.dbakevlar.blogspot.com

 

"Go away before I replace you with a very small and efficient shell script..."



-- http://www.freelists.org/webpage/oracle-l Received on Wed Jun 02 2010 - 08:02:21 CDT

Original text of this message