RE: Failover testing with 10g RAC

From: William Wagman <wjwagman_at_ucdavis.edu>
Date: Fri, 30 May 2008 08:57:20 -0700
Message-ID: <FE043305B38A0F448F3924429D650C2A07DE485F@VEXBE2.ex.ad3.ucdavis.edu>


Greetings,  

I don't know how or when the crs decides it is going to reboot the node but if you kill the crsd.bin process the node will reboot. That is part of it's job I think.  

Bill Wagman
Univ. of California at Davis
IET Campus Data Center
wjwagman_at_ucdavis.edu
(530) 754-6208

From: oracle-l-bounce_at_freelists.org
[mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Bradd Piontek Sent: Friday, May 30, 2008 8:49 AM
To: jeffthomas24_at_gmail.com
Cc: oracle-l
Subject: Re: Failover testing with 10g RAC  

Jeff,
  Are the pieces you are failing redundant in nature? For example, multiple HBAs, switches etc? We had some issues in our fail-over testing that had to do with Service Processor fail-over and it was due to a Linux kernel issue and nmi watchdog processes (again, this was on linux). Without redundancy in the components you mentioned, I would expect CRS to reboot the node. What are you using for OCR and Voting Disk?
--

Bradd Piontek
Twitter: http://www.twitter.com/piontekdd Oracle Blog: http://piontekdd.blogspot.com Linked In: http://www.linkedin.com/in/piontekdd Last.fm: http://www.last.fm/user/piontekdd/

On Fri, May 30, 2008 at 10:21 AM, Jeffery Thomas <jeffthomas24_at_gmail.com> wrote:

Solaris 10, RAC 10.2.0.3. Using IPMP groups for NIC redundancy.

We've been conducting failover testing -- disabling a HBA port, power off a switch,
yank an IC link, etc.

In every single case, CRS rebooted the server where the dire deed was performed,
and when the server came back up, the repair was successful, e.g. failed over to
the secondary HBA port, or the physical IP for the IPMP group floated to the standby
NIC and so forth.

The other server stayed up and all Oracle components remained available. During
the switch power off test, the physical IP for the IC actually floated over to the
standby NIC with no outage on this server.

Is this what is to be expected? CRS will always reboot a server to repair
itself when an underlying hardware failure is detected?

Thanks,
Jeff
--

http://www.freelists.org/webpage/oracle-l

--

http://www.freelists.org/webpage/oracle-l Received on Fri May 30 2008 - 10:57:20 CDT

Original text of this message