CSSD panics all nodes when single node loses cluster interconnect

From: Maureen English <sxmte_at_email.alaska.edu>
Date: Thu, 05 Feb 2009 16:43:20 -0900
Message-ID: <498B95B8.4000103_at_email.alaska.edu>

A coworker of mine asked if I could see if anyone on this list has seen anything like this problem we are having, and if there is a solution. We've opened a Service Request with Oracle, so if they have a solution, I'll post it to the list, too.

Three node RHEL5.2.x86_64 cluster running Oracle Clusterware Kernel 2.6.18-92.1.18.el5
OCFS2 2.6.18-92.1.18.el5-1.4.1-1.el5.x86_64

Each node has two 2-port gigabit nics, using bonding module and two switches to provide redundancy. Bond0 is the public interface, Bond1 is the cluster interconnect. Testing private interconnect failure using 'ifconfig bond1 down' on any single node would cause the entire cluster to panic approximately 90% of the time.

Looking at log files (/var/log/messages, $CRSHOME/log/$NODE/cssd/ocssd.log) showed that the two 'live' nodes are losing the voting disks before OCFS2 can finish evicting the 'dead' node from the cluster, causing cssd to reboot them. Lowering the timing on 'OCFS_HEARTBEAT _THRESHOLD' and 'Network Idle Timeout' in OCFS2 configuration reduced the likelihood of the entire cluster panicking to approximately 20% of the time.

The chances of losing both nics/switches simultaneously is small, however management wants it looked at to determine if it's a known issue with no fix, misconfiguration, etc. before the cluster is put into production. Searching Metalink hasn't turned up anything very useful.

Is this an issue anyone has run into before? If so, how did you end up dealing with it?

Thanks (on behalf of my coworker, too),

  • Maureen
Received on Thu Feb 05 2009 - 19:43:20 CST

Original text of this message