Re: 10.2.0.3 RAC Eviction

From: Pete's <empete2000_at_yahoo.com>
Date: Tue, 04 Sep 2007 13:17:35 -0700
Message-ID: <1188937055.983357.258000@22g2000hsm.googlegroups.com>

On Sep 4, 11:07 am, DA Morgan <damor..._at_psoug.org> wrote:
> Pete's wrote:
> > On Aug 29, 12:04 pm, Marc Bruinsma <marc.bruin..._at_chello.nl> wrote:
> >> Pete's wrote:
> >>> Vital info, 10.2.0.3 RAC PC6, 2 nodes, AIX 5.3 TL05 SP04.
> >>> Had what appears to be a network event last night, AIX did not log any
> >>> network link down events on either node, but found that an NFS mount
> >>> on the survining node failed. Looking thru the ocssd.log file on the
> >>> surviving node, it appears that node 1 evicted node 2 due to failures
> >>> of network heart beats. The public and private interfaces on each
> >>> node are etherchanneled.
> >>> Does anyone have an idea as to what the following codes mean from the
> >>> ocssd.log file(particulary state_disk)?
> >>> node(2) timeout(202) state_network(5) state_disk(3)
> >>> Also note, I have a test RAC setup that exhibited the same behavior at
> >>> nearly the exact same time.
> >>> TIA,
> >>> Pete's
> >> Pete's,
>
> >> The question then is, what do the two clusters have in common? Is it the
> >> connection to the NFS mount/device. (is there a voting device on there?).
>
> >> Since you are talking about Etherchannel (sounds a bit like a Cisco switch),
> >> has the switch been properly configured (think of 802.3ad aggregation, lacp
> >> mode, etc..), because under load the NIC's can crap out if the switch is
> >> not properly configured for teaming the NIC's.
>
> >> I've seen something like this happen on RAC 10.2.0.2 on Linux with teamed
> >> NIC's over a Cisco switch. Different OS, I know, but still...
>
> >> Marc
>
> > There are no cluster related files on the nfs mount. All that is on
> > there are common scripts that are shared with other non-clustered/
> > clustered servers between two different sites.
>
> > Ether channel is configured properly as if it wasn't, the networking
> > would not work properly, this was a scenario I tested before
> > implementing. Also, the test cluster has no ether-channel to speak
> > of, so that's not a commonality. I believe I know what the problem
> > and that was a network event that my network people do not know if
> > something happened at all. The only commonality I've found is that it
> > was this particular site and no others. To take clustering out the
> > picture, a non clustered server also experienced issues, or rather, an
> > application had issues connecting to a db on a non-clustered server at
> > this same site. Too many coincidences at the same time.
>
> > The original question stands, does anyone have an idea as to what the
> > code's mean(particularly state_disk)?
>
> > node(2) timeout(202) state_network(5) state_disk(3)
>
> > I've searched metalink on this already and did not come up with
> > anything, yet. Back to the logging, I see that there are network
> > heartbeat issues and I'm attempting to confirm the status of the disk
> > heartbeat.
>
> > Tia,
> > Pete's.
>
> Assuming CISCO ... there is a known bug related to TNSNAMES.ORA.
> I don't have the reference handy but you can look it up on metalink.
> --
> Daniel A. Morgan
> University of Washington
> damor..._at_x.washington.edu (replace x with u to respond)
> Puget Sound Oracle Users Groupwww.psoug.org

Yes, Cisco. I'll look for it.

Thanks,
Pete's Received on Tue Sep 04 2007 - 15:17:35 CDT