Re: Oracle iAS - OC4J process "stop"'ped

From: Tony van Lingen <tony_vanlingen_at_technologyonecorp.com>
Date: Mon, 14 Sep 2009 09:29:10 +1000
Message-ID: <4AAD8046.2050002_at_technologyonecorp.com>



Hi Lyall,

Yes, that happens often. What happens is that opmn keeps a state file for every process it's supposed to manage. When a process exits and another reuses its PID between polls, opmn goes on thinking that the process is running, but since it's another process the response is not what's expected. Therefore the process is marked "stop" rather than "down". When you now do a stopall, opmn actively tries to kill the new process, but it won't succeed unless it runs under the same userid.

To get out of this, after doing the stopall (and verifying that all OAS processes are really down e.g. using ps) , delete all files in ${ORACLE_HOME}/opmn/logs/states.

Be aware that emd also keeps track of process ports in the file ${ORACLE_HOME}/sysman/emd/targets.xml. if they don't match with the ports used by opmn, you'll find that opmn reports a process as "alive" whereas it is shown as down in the enterprise manager. To compound thing further, dcm keeps a log of states that processes should be in. If you manually kill and restart processes, say an OC4J process, you may get into a situation that dcm forces the process down again, since the last command it logged was a stop command. You may have to use dcmctl in combination with opmnctl to repair this.

Killing system processes such as xinetd is not usually a good idea.

Hope this helps,
Tony

Around 12/09/2009 12:14 AM, Lyall Barbour said:
> Anyone ever seen this, where the iAS server console status is up and running and a status on opmnctl, the HTTP_Server is "Alive" but the OC4J process is "Stop" and the pid is the pid of another process running?
>
> bss1.tri-c.edu: opmnctl status
> Processes in Instance: bss1.bss1.tri-c.edu
> -------------------+--------------------+---------+---------
> ias-component | process-type | pid | status
> -------------------+--------------------+---------+---------
> LogLoader | logloaderd | N/A | Down
> dcm-daemon | dcm-daemon | N/A | Down
> OC4J | home | 30867 | Alive
> HTTP_Server | HTTP_Server | 30868 | Alive
> DSA | DSA | N/A | Down
>
> i fixed the problem, but wanted to know if anybody has seen this, so that i can have it NOT happen again. The status of OC4J was Stop and the pid was 2047 which was the same pid as the xinetd process running. I tried to stopall and wait, then startall, but OC4J really wanted to use 2047 pid. So, i logged in as root and killall xinetd, logged oracle and opmnctl stopall, waited, then did xinetd -stayalive -pidfile /var/run/xinetd.pid, which is what is in our rc3.d script when the server boots. Xinetd didn't use 2047 anymore, but used:
>
> bss1.tri-c.edu: ps -eaf|grep xinetd
> oracle 591 356 0 10:13 pts/2 00:00:00 grep xinetd
> root 30802 1 0 09:51 ? 00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
>
> and, of course, the new opmnctl startall has OC4J using what's above.
>
> Anyone seen that?
>
> Thanks,
> Lyall
>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Sep 13 2009 - 18:29:10 CDT

Original text of this message