Path: text.usenetserver.com!out04a.usenetserver.com!news.usenetserver.com!in02.usenetserver.com!news.usenetserver.com!postnews.google.com!news2.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!nx01.iad01.newshosting.com!newshosting.com!post01.iad01!not-for-mail
Date: Tue, 04 Sep 2007 09:07:40 -0700
From: DA Morgan <damorgan@psoug.org>
Organization: Puget Sound Oracle Users Group
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
Newsgroups: comp.databases.oracle.server
Subject: Re: 10.2.0.3 RAC Eviction
References: <1188400796.145282.132870@19g2000hsx.googlegroups.com>   <EGhBi.588$9V2.198@amstwist00> <1188416714.076043.38380@y42g2000hsy.googlegroups.com>
In-Reply-To: <1188416714.076043.38380@y42g2000hsy.googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <1188922054.120725@bubbleator.drizzle.com>
Cache-Post-Path: bubbleator.drizzle.com!unknown@216.162.218.178
X-Cache: nntpcache 3.0.1 (see http://www.nntpcache.org/)
Lines: 68
X-Complaints-To: abuse@csolutions.net
Xref: usenetserver.com comp.databases.oracle.server:434365
X-Received-Date: Tue, 04 Sep 2007 12:07:34 EDT (text.usenetserver.com)

Pete's wrote:
> On Aug 29, 12:04 pm, Marc Bruinsma <marc.bruin...@chello.nl> wrote:
>> Pete's wrote:
>>> Vital info, 10.2.0.3 RAC PC6, 2 nodes, AIX 5.3 TL05 SP04.
>>> Had what appears to be a network event last night, AIX did not log any
>>> network link down events on either node, but found that an NFS mount
>>> on the survining node failed.  Looking thru the ocssd.log file on the
>>> surviving node, it appears that node 1 evicted node 2 due to failures
>>> of network heart beats.  The public and private interfaces on each
>>> node are etherchanneled.
>>> Does anyone have an idea as to what the following codes mean from the
>>> ocssd.log file(particulary state_disk)?
>>>  node(2) timeout(202) state_network(5) state_disk(3)
>>> Also note, I have a test RAC setup that exhibited the same behavior at
>>> nearly the exact same time.
>>> TIA,
>>> Pete's
>> Pete's,
>>
>> The question then is, what do the two clusters have in common? Is it the
>> connection to the NFS mount/device. (is there a voting device on there?).
>>
>> Since you are talking about Etherchannel (sounds a bit like a Cisco switch),
>> has the switch been properly configured (think of 802.3ad aggregation, lacp
>> mode, etc..), because under load the NIC's can crap out if the switch is
>> not properly configured for teaming the NIC's.
>>
>> I've seen something like this happen on RAC 10.2.0.2 on Linux with teamed
>> NIC's over a Cisco switch. Different OS, I know, but still...
>>
>> Marc
> 
> There are no cluster related files on the nfs mount.  All that is on
> there are common scripts that are shared with other non-clustered/
> clustered servers between two different sites.
> 
> Ether channel is configured properly as if it wasn't, the networking
> would not work properly, this was a scenario I tested before
> implementing.  Also, the test cluster has no ether-channel to speak
> of, so that's not a commonality.  I believe I know what the problem
> and that was a network event that my network people do not know if
> something happened at all.  The only commonality I've found is that it
> was this particular site and no others.  To take clustering out the
> picture, a non clustered server also experienced issues, or rather, an
> application had issues connecting to a db on a non-clustered server at
> this same site.  Too many coincidences at the same time.
> 
> The original question stands, does anyone have an idea as to what the
> code's mean(particularly state_disk)?
> 
> node(2) timeout(202) state_network(5) state_disk(3)
> 
> I've searched metalink on this already and did not come up with
> anything, yet.  Back to the logging, I see that there are network
> heartbeat issues and I'm attempting to confirm the status of the disk
> heartbeat.
> 
> Tia,
> Pete's.

Assuming CISCO ... there is a known bug related to TNSNAMES.ORA.
I don't have the reference handy but you can look it up on metalink.
-- 
Daniel A. Morgan
University of Washington
damorgan@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org
