Re: Split-brain among HACMP cluster and Oracle9RAC

From: ArneS <arsodal_at_NOSPAM.broadpark.no>
Date: Fri, 22 Sep 2006 17:19:17 +0200
Message-ID: <4513fef5$1@news.broadpark.no>

"Matthias Hoys" <anti_at_spam.com> wrote in message news:45130558$0$10453$ba620e4c_at_news.skynet.be...
>
> "Arne S" <arnsodal_at_broadpark.no> wrote in message
> news:45130291$1_at_news.broadpark.no...
>> Hajo Ehlers wrote:
>>> Arne S wrote:
>>>
>>>>Background:
>>>>Part of our production environment is based on RS/6000 technology, with
>>>>HACMP and Oracle9RAC as products on top. We have 4 p570's (4-ways),
>>>>running AIX 5.3ML03, HACMP version 5.2 and OracleRAC version 9.2.0.7.
>>>>These machines are spread across 2 server rooms (about 300meters
>>>>distance). HACMP is configured witch concurrent disk access for Oracle
>>>>db-files on raw devices. Also we have configured HACMP with both IP and
>>>>NON-IP heartbeat (NON-IP heartbeat over SAN-disks). Oracle's
>>>>interconnect are configured as part of HACMP configuration. The total
>>>>number of databases/instances are about 20/80.
>>>>
>>>>My problem:
>>>>During a test failover (the network in one serverrom goes down) I
>>>>observed that all Oracle databases went to "freezed" condition. As far
>>>>as I know, this is not correct. I have problem to find out why, but my
>>>>guess is that Oracle is waiting for some "network down" or "node down"
>>>>from HACMP before Oracle do some action. This will not happend, because
>>>>HACMP is talking to all 4 nodes over NON-IP network over the SAN disks
>>>>in such situation. When I shut down these 2 "isolated" machines, all
>>>>Oracle databases went down (lmon died). I had to start all databases
>>>>manually on the 2 "surviving" nodes. After startup I could access the
>>>>databases as normal.
>>>
>>>
>>>>From the HACMPredbook
>>> ...
>>> The non-IP networks are direct connections (point-to-point) between
>>> nodes, and
>>> do not use IP for heartbeat messages exchange, and are therefore less
>>> prone to
>>> IP network elements failures. If these network types are used, in case
>>> of IP
>>> network failure, nodes will still be able to exchange messages, so the
>>> decision is
>>> to consider the network down and no resource group activity will take
>>> place.
>>> ...
>>>
>>> So the non-ip network is designed to prevent split brain situation.
>>>
>>> You say:
>>>
>>>>the network in one serverrom goes down
>>>
>>> The question: What do you mean which that sentence ?
>>> Have you been taken offline all network devices connected to the hamcp
>>> cluster - in this case you would have a network down event and the
>>> cluster should go down OR did you interrupt the conncetion between both
>>> site.
>>>
>>> In the later case you have a site failure from each cluster point of
>>> view.
>>> Meaning that HACMP does see that it has still a connection to its
>>> swiches ( so the physic is okay ) but any IP communication path to the
>>> other site is lost.
>>>
>>> So the question arise, how shell HACMP behave ? It does not know if the
>>> other site still has a connection to the (user)network or not. So its
>>> up to you to determine which site shall stay up.
>>>
>>> Just from my very rusty hacmp knowledge
>>> Hajo
>>>
>> Good point....
>> We turned off two switches, didn't do anything on the AIX serveres.....
>> Just observed what went wrong...
>> ArneS
>
> Couldn't you just use 2 redundant IP-heartbeats ? Cause if your IP network
> is down, your db server won't be reachable anyway no ?
>
>

Won't help cause we anyway need NON-IP network in addition should the TCP/IP subsystem fail on any node.
ArneS Received on Fri Sep 22 2006 - 10:19:17 CDT