Re: Split-brain among HACMP cluster and Oracle9RAC

From: Matthias Hoys <anti_at_spam.com>
Date: Thu, 21 Sep 2006 23:34:14 +0200
Message-ID: <45130558$0$10453$ba620e4c@news.skynet.be>

"Arne S" <arnsodal_at_broadpark.no> wrote in message news:45130291$1_at_news.broadpark.no...

> Hajo Ehlers wrote:

>> Arne S wrote:
>>
>>>Background:
>>>Part of our production environment is based on RS/6000 technology, with
>>>HACMP and Oracle9RAC as products on top. We have 4 p570's (4-ways),
>>>running AIX 5.3ML03, HACMP version 5.2 and OracleRAC version 9.2.0.7.
>>>These machines are spread across 2 server rooms (about 300meters
>>>distance). HACMP is configured witch concurrent disk access for Oracle
>>>db-files on raw devices. Also we have configured HACMP with both IP and
>>>NON-IP heartbeat (NON-IP heartbeat over SAN-disks). Oracle's
>>>interconnect are configured as part of HACMP configuration. The total
>>>number of databases/instances are about 20/80.
>>>
>>>My problem:
>>>During a test failover (the network in one serverrom goes down) I
>>>observed that all Oracle databases went to "freezed" condition. As far
>>>as I know, this is not correct. I have problem to find out why, but my
>>>guess is that Oracle is waiting for some "network down" or "node down"
>>>from HACMP before Oracle do some action. This will not happend, because
>>>HACMP is talking to all 4 nodes over NON-IP network over the SAN disks
>>>in such situation. When I shut down these 2 "isolated" machines, all
>>>Oracle databases went down (lmon died). I had to start all databases
>>>manually on the 2 "surviving" nodes. After startup I could access the
>>>databases as normal.
>>
>>
>>>From the HACMPredbook
>> ...
>> The non-IP networks are direct connections (point-to-point) between
>> nodes, and
>> do not use IP for heartbeat messages exchange, and are therefore less
>> prone to
>> IP network elements failures. If these network types are used, in case
>> of IP
>> network failure, nodes will still be able to exchange messages, so the
>> decision is
>> to consider the network down and no resource group activity will take
>> place.
>> ...
>>
>> So the non-ip network is designed to prevent split brain situation.
>>
>> You say:
>>
>>>the network in one serverrom goes down
>>
>> The question: What do you mean which that sentence ?
>> Have you been taken offline all network devices connected to the hamcp
>> cluster - in this case you would have a network down event and the
>> cluster should go down OR did you interrupt the conncetion between both
>> site.
>>
>> In the later case you have a site failure from each cluster point of
>> view.
>> Meaning that HACMP does see that it has still a connection to its
>> swiches ( so the physic is okay ) but any IP communication path to the
>> other site is lost.
>>
>> So the question arise, how shell HACMP behave ? It does not know if the
>> other site still has a connection to the (user)network or not. So its
>> up to you to determine which site shall stay up.
>>
>> Just from my very rusty hacmp knowledge
>> Hajo
>>

> Good point....
> We turned off two switches, didn't do anything on the AIX serveres..... 
> Just observed what went wrong...
> ArneS

Couldn't you just use 2 redundant IP-heartbeats ? Cause if your IP network is down, your db server won't be reachable anyway no ? Received on Thu Sep 21 2006 - 16:34:14 CDT