Re: 10.2.0.3 RAC Eviction

From: Pete's <empete2000_at_yahoo.com>
Date: Wed, 29 Aug 2007 12:45:14 -0700
Message-ID: <1188416714.076043.38380@y42g2000hsy.googlegroups.com>

On Aug 29, 12:04 pm, Marc Bruinsma <marc.bruin..._at_chello.nl> wrote:
> Pete's wrote:
> > Vital info, 10.2.0.3 RAC PC6, 2 nodes, AIX 5.3 TL05 SP04.
>
> > Had what appears to be a network event last night, AIX did not log any
> > network link down events on either node, but found that an NFS mount
> > on the survining node failed. Looking thru the ocssd.log file on the
> > surviving node, it appears that node 1 evicted node 2 due to failures
> > of network heart beats. The public and private interfaces on each
> > node are etherchanneled.
>
> > Does anyone have an idea as to what the following codes mean from the
> > ocssd.log file(particulary state_disk)?
>
> > node(2) timeout(202) state_network(5) state_disk(3)
>
> > Also note, I have a test RAC setup that exhibited the same behavior at
> > nearly the exact same time.
>
> > TIA,
> > Pete's
>
> Pete's,
>
> The question then is, what do the two clusters have in common? Is it the
> connection to the NFS mount/device. (is there a voting device on there?).
>
> Since you are talking about Etherchannel (sounds a bit like a Cisco switch),
> has the switch been properly configured (think of 802.3ad aggregation, lacp
> mode, etc..), because under load the NIC's can crap out if the switch is
> not properly configured for teaming the NIC's.
>
> I've seen something like this happen on RAC 10.2.0.2 on Linux with teamed
> NIC's over a Cisco switch. Different OS, I know, but still...
>
> Marc

There are no cluster related files on the nfs mount. All that is on there are common scripts that are shared with other non-clustered/ clustered servers between two different sites.

Ether channel is configured properly as if it wasn't, the networking would not work properly, this was a scenario I tested before implementing. Also, the test cluster has no ether-channel to speak of, so that's not a commonality. I believe I know what the problem and that was a network event that my network people do not know if something happened at all. The only commonality I've found is that it was this particular site and no others. To take clustering out the picture, a non clustered server also experienced issues, or rather, an application had issues connecting to a db on a non-clustered server at this same site. Too many coincidences at the same time.

The original question stands, does anyone have an idea as to what the code's mean(particularly state_disk)?

node(2) timeout(202) state_network(5) state_disk(3)

I've searched metalink on this already and did not come up with anything, yet. Back to the logging, I see that there are network heartbeat issues and I'm attempting to confirm the status of the disk heartbeat.

Tia,
Pete's. Received on Wed Aug 29 2007 - 14:45:14 CDT