Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Split-brain among HACMP cluster and Oracle9RAC

Split-brain among HACMP cluster and Oracle9RAC

From: Arne S <arnsodal_at_broadpark.no>
Date: Thu, 21 Sep 2006 18:29:09 +0200
Message-ID: <4512bddf$1@news.broadpark.no>


Background:
Part of our production environment is based on RS/6000 technology, with HACMP and Oracle9RAC as products on top. We have 4 p570's (4-ways), running AIX 5.3ML03, HACMP version 5.2 and OracleRAC version 9.2.0.7. These machines are spread across 2 server rooms (about 300meters distance). HACMP is configured witch concurrent disk access for Oracle db-files on raw devices. Also we have configured HACMP with both IP and NON-IP heartbeat (NON-IP heartbeat over SAN-disks). Oracle's interconnect are configured as part of HACMP configuration. The total number of databases/instances are about 20/80.

My problem:
During a test failover (the network in one serverrom goes down) I observed that all Oracle databases went to "freezed" condition. As far as I know, this is not correct. I have problem to find out why, but my guess is that Oracle is waiting for some "network down" or "node down" from HACMP before Oracle do some action. This will not happend, because HACMP is talking to all 4 nodes over NON-IP network over the SAN disks in such situation. When I shut down these 2 "isolated" machines, all Oracle databases went down (lmon died). I had to start all databases manually on the 2 "surviving" nodes. After startup I could access the databases as normal.

I have been in contact with Oracle Support, and they say: "The configuration is insane. The fix is to configure the clusterware heartbeat and the oracle heartbeat on the same network. HACMP and our clusterware must see the same view of the cluster."

But what about the NON-IP heartbeat? HACMP MUST be configured to do heartbeating over IP and NON-IP network to avoid split in cluster, and to avoid disk/data corruption.

I don't think we are the only one customer running AIX, HACMP, concurrent disk acess on raw devices and Oracle9RAC. Therefore I hope that you or somebody else can help me resolving this issue.

I have opened a service request against both Oracle Support and IBM Support and I hope that somebody can help solving this issue. But both parts claime on the opposite products....

Any ideas? Shold I make some custom activity in HACMP to disable NON-IP disk heartbeat network if this happens? Sounds like lot of shampoo for hairless... I presume this could be more like "out-of-box" since the product certify matrix is OK..? (Yes I know HACMP is not out-of-the-box-product, I think I have pretty good control of my HACMP.)

Any ideas?

Thanks for your time, and thanks in advance!

ArneS Received on Thu Sep 21 2006 - 11:29:09 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US