Re: Split-brain among HACMP cluster and Oracle9RAC

From: Arne S <arnsodal_at_broadpark.no>
Date: Thu, 21 Sep 2006 23:23:20 +0200
Message-ID: <45130291$1@news.broadpark.no>

Hajo Ehlers wrote:
> Arne S wrote:
>

>>Background:
>>Part of our production environment is based on RS/6000 technology, with
>>HACMP and Oracle9RAC as products on top. We have 4  p570's (4-ways),
>>running AIX 5.3ML03, HACMP version 5.2 and OracleRAC version 9.2.0.7.
>>These machines are spread across 2 server rooms (about 300meters
>>distance). HACMP is configured witch concurrent disk access for Oracle
>>db-files on raw devices. Also we have configured HACMP with both IP and
>>NON-IP heartbeat (NON-IP heartbeat over SAN-disks). Oracle's
>>interconnect are configured as part of HACMP configuration. The total
>>number of databases/instances are about 20/80.
>>
>>My problem:
>>During a test failover (the network in one serverrom goes down) I
>>observed that all Oracle databases went to "freezed" condition. As far
>>as I know, this is not correct. I have problem to find out why, but my
>>guess is that Oracle is waiting for some "network down" or "node down"
>>from HACMP before Oracle do some action. This will not happend, because
>>HACMP is talking to all 4 nodes over NON-IP network over the SAN disks
>>in such situation. When I shut down these 2 "isolated" machines, all
>>Oracle databases went down (lmon died). I had to start all databases
>>manually on the 2 "surviving" nodes. After startup I could access the
>>databases as normal.

>
>

>>From the HACMPredbook

> ...
> The non-IP networks are direct connections (point-to-point) between
> nodes, and
> do not use IP for heartbeat messages exchange, and are therefore less
> prone to
> IP network elements failures. If these network types are used, in case
> of IP
> network failure, nodes will still be able to exchange messages, so the
> decision is
> to consider the network down and no resource group activity will take
> place.
> ...
>
> So the non-ip network is designed to prevent split brain situation.
>
> You say:
>

>>the network in one serverrom goes down

>
> The question: What do you mean which that sentence ?
> Have you been taken offline all network devices connected to the hamcp
> cluster - in this case you would have a network down event and the
> cluster should go down OR did you interrupt the conncetion between both
> site.
>
> In the later case you have a site failure from each cluster point of
> view.
> Meaning that HACMP does see that it has still a connection to its
> swiches ( so the physic is okay ) but any IP communication path to the
> other site is lost.
>
> So the question arise, how shell HACMP behave ? It does not know if the
> other site still has a connection to the (user)network or not. So its
> up to you to determine which site shall stay up.
>
> Just from my very rusty hacmp knowledge
> Hajo
>

Good point....
We turned off two switches, didn't do anything on the AIX serveres..... Just observed what went wrong...
ArneS Received on Thu Sep 21 2006 - 16:23:20 CDT