CSSD panics all nodes when single node loses cluster interconnect
Date: Thu, 05 Feb 2009 16:43:20 -0900
A coworker of mine asked if I could see if anyone on this list has seen anything like this problem we are having, and if there is a solution. We've opened a Service Request with Oracle, so if they have a solution, I'll post it to the list, too.
Three node RHEL5.2.x86_64 cluster running Oracle Clusterware 10.2.0.4
Each node has two 2-port gigabit nics, using bonding module and two switches to provide redundancy. Bond0 is the public interface, Bond1 is the cluster interconnect. Testing private interconnect failure using 'ifconfig bond1 down' on any single node would cause the entire cluster to panic approximately 90% of the time.
Looking at log files (/var/log/messages, $CRSHOME/log/$NODE/cssd/ocssd.log) showed that the two 'live' nodes are losing the voting disks before OCFS2 can finish evicting the 'dead' node from the cluster, causing cssd to reboot them. Lowering the timing on 'OCFS_HEARTBEAT _THRESHOLD' and 'Network Idle Timeout' in OCFS2 configuration reduced the likelihood of the entire cluster panicking to approximately 20% of the time.
The chances of losing both nics/switches simultaneously is small, however management wants it looked at to determine if it's a known issue with no fix, misconfiguration, etc. before the cluster is put into production. Searching Metalink hasn't turned up anything very useful.
Is this an issue anyone has run into before? If so, how did you end up dealing with it?
Thanks (on behalf of my coworker, too),