Re: Question re racgmain processes running amok

From: Jeremy Schneider <jeremy.schneider_at_ardentperf.com>
Date: Fri, 21 Mar 2008 17:10:58 -0500
Message-ID: <611ad3510803211510q3061fcf4ga840e9cb5931fdbc@mail.gmail.com>


I meant SYN_SENT, not TCP_WAIT. Slightly different.

On Fri, Mar 21, 2008 at 5:09 PM, Jeremy Schneider < jeremy.schneider_at_ardentperf.com> wrote:

> That's the workhorse script called by CRS to start/stop/stat resources.
> Find out what the parameter is (start, stop or stat) with something like
> this:
>
> cat /proc/[pid####]/cmdline|tr '\000' '\n'
>
> That'll tell us whether CRS is continually restarting ONS or just trying
> to "stat" it. (crs_stat can also tell you if there were failed restarts.)
> Then you might try to figure out what racgmain is waiting for. To start I'd
> look at the process status (is it 'D'? what's WCHAN from ps -l?) and the
> network connections (does netstat show any connections in TCP_WAIT state?).
> You might also get a stack trace with gdb -p and then "backtrace".
>
> Just a few ideas... I'm really interested to hear what you turn up. :)
>
> -Jeremy
>
>
>
> On Fri, Mar 21, 2008 at 3:21 PM, William Wagman <wjwagman_at_ucdavis.edu>
> wrote:
>
> > Greetings,
> >
> > The question pertains to a two node RAC cluster running Oracle
> > 10.2.0.3.0 SE on 32-bit Linux 2.6.9-67.ELsmp. CRS, ASM & RDBMS are each
> > in a separate home. Yesterday on node 1 I started seeing messages in the
> > /var/log/messages file of the form...
> >
> > Mar 20 07:5:34 spenser init: Id "h3" respawning too fast: disabled for 5
> > minutes
> >
> > We did some looking around to try and determine the cause of this but
> > didn't come up with anything immediately. There were a core dump
> > generated in the $CRS_HOME/log/<node_name>/crsd directory at about the
> > time we noticed this beginning. Various error messages indicating
> > various failures (I can provide a segment) appeared at this time in the
> > crsd.log also. At this point I didn't know what was occurring so opened
> > an SR with Oracle.
> >
> > This morning, which gathering some additional information I found that
> > on node2 in this cluster there were a large number of racgmain processes
> > running and the number of these processes running was increasing, all
> > the swap space and virtually all of the memory on this node were in use.
> > Some of the processes were running out of the CRS home and some out of
> > the ASM home. I did some investigating to see if it would be possible to
> > stop these processes gracefully and was unable to gather any
> > information. Ultimately we rebooted node2 of the cluster and everything
> > appears to be functioning as is expected at this point.
> >
> > My question is what would cause the racgmain process to run amok this
> > way. Currently ps -ef|grep racgmain shows none running on either node.
> > I'm puzzled by this and other than information indicating that this
> > process is part of ONS I am not able to find any further information or
> > details. Any suggestions would be greatly appreciated.
> >
> > Thanks.
> >
> > Bill Wagman
> > Univ. of California at Davis
> > IET Campus Data Center
> > wjwagman_at_ucdavis.edu
> > (530) 754-6208
> >
> > --
> > http://www.freelists.org/webpage/oracle-l
> >
> >
> >
>
>
> --
> Jeremy Schneider
> Chicago, IL
> http://www.ardentperf.com/category/technical
>

-- 
Jeremy Schneider
Chicago, IL
http://www.ardentperf.com/category/technical

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Mar 21 2008 - 17:10:58 CDT

Original text of this message