Re: Question re racgmain processes running amok

From: Jeremy Schneider <jeremy.schneider_at_ardentperf.com>
Date: Fri, 21 Mar 2008 17:09:15 -0500
Message-ID: <611ad3510803211509m1ff2f02o357695d6a86e6e8a@mail.gmail.com>


That's the workhorse script called by CRS to start/stop/stat resources. Find out what the parameter is (start, stop or stat) with something like this:

cat /proc/[pid####]/cmdline|tr '\000' '\n'

That'll tell us whether CRS is continually restarting ONS or just trying to "stat" it. (crs_stat can also tell you if there were failed restarts.) Then you might try to figure out what racgmain is waiting for. To start I'd look at the process status (is it 'D'? what's WCHAN from ps -l?) and the network connections (does netstat show any connections in TCP_WAIT state?). You might also get a stack trace with gdb -p and then "backtrace".

Just a few ideas... I'm really interested to hear what you turn up. :)

-Jeremy

On Fri, Mar 21, 2008 at 3:21 PM, William Wagman <wjwagman_at_ucdavis.edu> wrote:

> Greetings,
>
> The question pertains to a two node RAC cluster running Oracle
> 10.2.0.3.0 SE on 32-bit Linux 2.6.9-67.ELsmp. CRS, ASM & RDBMS are each
> in a separate home. Yesterday on node 1 I started seeing messages in the
> /var/log/messages file of the form...
>
> Mar 20 07:5:34 spenser init: Id "h3" respawning too fast: disabled for 5
> minutes
>
> We did some looking around to try and determine the cause of this but
> didn't come up with anything immediately. There were a core dump
> generated in the $CRS_HOME/log/<node_name>/crsd directory at about the
> time we noticed this beginning. Various error messages indicating
> various failures (I can provide a segment) appeared at this time in the
> crsd.log also. At this point I didn't know what was occurring so opened
> an SR with Oracle.
>
> This morning, which gathering some additional information I found that
> on node2 in this cluster there were a large number of racgmain processes
> running and the number of these processes running was increasing, all
> the swap space and virtually all of the memory on this node were in use.
> Some of the processes were running out of the CRS home and some out of
> the ASM home. I did some investigating to see if it would be possible to
> stop these processes gracefully and was unable to gather any
> information. Ultimately we rebooted node2 of the cluster and everything
> appears to be functioning as is expected at this point.
>
> My question is what would cause the racgmain process to run amok this
> way. Currently ps -ef|grep racgmain shows none running on either node.
> I'm puzzled by this and other than information indicating that this
> process is part of ONS I am not able to find any further information or
> details. Any suggestions would be greatly appreciated.
>
> Thanks.
>
> Bill Wagman
> Univ. of California at Davis
> IET Campus Data Center
> wjwagman_at_ucdavis.edu
> (530) 754-6208
>
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

-- 
Jeremy Schneider
Chicago, IL
http://www.ardentperf.com/category/technical

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Mar 21 2008 - 17:09:15 CDT

Original text of this message