RE: CSSD panics all nodes when single node loses cluster interconnect

From: Crisler, Jon <>
Date: Fri, 6 Feb 2009 18:55:08 -0500
Message-ID: <56211FD5795F8346A0719FEBC0DB067503CC5C46_at_mds3aex08.USIEXCHANGE.COM>

We have run into similar problems in the past. Adjusting the OCFS_heatbeat helps as you have found. A few things to note- 1) There are Oracle 10g RAC bundle patches available- make sure you have them installed. Some of these are just for the RDBMS, others are for CRS as well.
2) make sure you are using a late-edition OCFS2. 3) there are known problems with certain GLIBC rpm's- make sure yours are up to date. If you change any glibc rpm's, don't forget to relink oracle.
4) make sure your OCFS2 versions match the Linux kernels. They probably are correct otherwise they would not install, but triple check for compatibility.

I sort of wonder why you are testing via taking down a BOND interface. Obviously it should not affect other nodes, but a more realistic test is just to down individual interfaces, not the BOND interface. It also might indicate that your BONDING is not done correctly. Do the other nodes panic if you gracefully shutdown one node? If not, then I suspect BONDING issues.

The way we finally addressed this was.......ASM, but we still have lots of OCFS and OCFS2 RAC systems running. I can also tell you that OCFS2 seems to outperform RedHat GFS for RAC.

-----Original Message-----

[] On Behalf Of Maureen English Sent: Thursday, February 05, 2009 8:43 PM To:
Subject: CSSD panics all nodes when single node loses cluster interconnect

A coworker of mine asked if I could see if anyone on this list has seen anything like this problem we are having, and if there is a solution. We've opened a Service Request with Oracle, so if they have a solution, I'll post it to the list, too.

Three node RHEL5.2.x86_64 cluster running Oracle Clusterware Kernel 2.6.18-92.1.18.el5
OCFS2 2.6.18-92.1.18.el5-1.4.1-1.el5.x86_64

Each node has two 2-port gigabit nics, using bonding module and two switches
to provide redundancy. Bond0 is the public interface, Bond1 is the cluster
interconnect. Testing private interconnect failure using 'ifconfig bond1 down'
on any single node would cause the entire cluster to panic approximately 90% of
the time.

Looking at log files (/var/log/messages, $CRSHOME/log/$NODE/cssd/ocssd.log)
showed that the two 'live' nodes are losing the voting disks before OCFS2 can
finish evicting the 'dead' node from the cluster, causing cssd to reboot them.
Lowering the timing on 'OCFS_HEARTBEAT _THRESHOLD' and 'Network Idle Timeout' in
OCFS2 configuration reduced the likelihood of the entire cluster panicking to
approximately 20% of the time.

The chances of losing both nics/switches simultaneously is small, however
management wants it looked at to determine if it's a known issue with no fix,
misconfiguration, etc. before the cluster is put into production. Searching
Metalink hasn't turned up anything very useful.

Is this an issue anyone has run into before? If so, how did you end up dealing
with it?

Thanks (on behalf of my coworker, too),

  • Maureen


-- Received on Fri Feb 06 2009 - 17:55:08 CST

Original text of this message