Re: 10.2.0.3 RAC Eviction

From: DA Morgan <damorgan_at_psoug.org>
Date: Tue, 04 Sep 2007 09:07:40 -0700
Message-ID: <1188922054.120725@bubbleator.drizzle.com>

Pete's wrote:
> On Aug 29, 12:04 pm, Marc Bruinsma <marc.bruin..._at_chello.nl> wrote:

>> Pete's wrote:
>>> Vital info, 10.2.0.3 RAC PC6, 2 nodes, AIX 5.3 TL05 SP04.
>>> Had what appears to be a network event last night, AIX did not log any
>>> network link down events on either node, but found that an NFS mount
>>> on the survining node failed.  Looking thru the ocssd.log file on the
>>> surviving node, it appears that node 1 evicted node 2 due to failures
>>> of network heart beats.  The public and private interfaces on each
>>> node are etherchanneled.
>>> Does anyone have an idea as to what the following codes mean from the
>>> ocssd.log file(particulary state_disk)?
>>>  node(2) timeout(202) state_network(5) state_disk(3)
>>> Also note, I have a test RAC setup that exhibited the same behavior at
>>> nearly the exact same time.
>>> TIA,
>>> Pete's
>> Pete's,
>>
>> The question then is, what do the two clusters have in common? Is it the
>> connection to the NFS mount/device. (is there a voting device on there?).
>>
>> Since you are talking about Etherchannel (sounds a bit like a Cisco switch),
>> has the switch been properly configured (think of 802.3ad aggregation, lacp
>> mode, etc..), because under load the NIC's can crap out if the switch is
>> not properly configured for teaming the NIC's.
>>
>> I've seen something like this happen on RAC 10.2.0.2 on Linux with teamed
>> NIC's over a Cisco switch. Different OS, I know, but still...
>>
>> Marc

>
> There are no cluster related files on the nfs mount. All that is on
> there are common scripts that are shared with other non-clustered/
> clustered servers between two different sites.
>
> Ether channel is configured properly as if it wasn't, the networking
> would not work properly, this was a scenario I tested before
> implementing. Also, the test cluster has no ether-channel to speak
> of, so that's not a commonality. I believe I know what the problem
> and that was a network event that my network people do not know if
> something happened at all. The only commonality I've found is that it
> was this particular site and no others. To take clustering out the
> picture, a non clustered server also experienced issues, or rather, an
> application had issues connecting to a db on a non-clustered server at
> this same site. Too many coincidences at the same time.
>
> The original question stands, does anyone have an idea as to what the
> code's mean(particularly state_disk)?
>
> node(2) timeout(202) state_network(5) state_disk(3)
>
> I've searched metalink on this already and did not come up with
> anything, yet. Back to the logging, I see that there are network
> heartbeat issues and I'm attempting to confirm the status of the disk
> heartbeat.
>
> Tia,
> Pete's.

Assuming CISCO ... there is a known bug related to TNSNAMES.ORA. I don't have the reference handy but you can look it up on metalink.

-- 
Daniel A. Morgan
University of Washington
damorgan_at_x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org

Received on Tue Sep 04 2007 - 11:07:40 CDT