Question re racgmain processes running amok
Date: Fri, 21 Mar 2008 13:21:11 -0700
The question pertains to a two node RAC cluster running Oracle 10.2.0.3.0 SE on 32-bit Linux 2.6.9-67.ELsmp. CRS, ASM & RDBMS are each in a separate home. Yesterday on node 1 I started seeing messages in the /var/log/messages file of the form...
Mar 20 07:5:34 spenser init: Id "h3" respawning too fast: disabled for 5 minutes
We did some looking around to try and determine the cause of this but didn't come up with anything immediately. There were a core dump generated in the $CRS_HOME/log/<node_name>/crsd directory at about the time we noticed this beginning. Various error messages indicating various failures (I can provide a segment) appeared at this time in the crsd.log also. At this point I didn't know what was occurring so opened an SR with Oracle.
This morning, which gathering some additional information I found that on node2 in this cluster there were a large number of racgmain processes running and the number of these processes running was increasing, all the swap space and virtually all of the memory on this node were in use. Some of the processes were running out of the CRS home and some out of the ASM home. I did some investigating to see if it would be possible to stop these processes gracefully and was unable to gather any information. Ultimately we rebooted node2 of the cluster and everything appears to be functioning as is expected at this point.
My question is what would cause the racgmain process to run amok this way. Currently ps -ef|grep racgmain shows none running on either node. I'm puzzled by this and other than information indicating that this process is part of ONS I am not able to find any further information or details. Any suggestions would be greatly appreciated.
Univ. of California at Davis
IET Campus Data Center