Fwd:Re: Oracle RAC and VIPs

From: Alessandro Vercelli <alever_at_libero.it>
Date: Wed, 16 Jul 2008 12:25:17 +0200
Message-Id: <K43GY5$99C3C4AB49E8D1FA71669C313853ACA6@libero.it>


e-sending only to Oracle-L (overquoting....)

> I have a couple of follow up questions:
> 1. When was the last time you executed a successful failover test of this environment?

This time has been the first one I worked on that platform, which is under development by another team, but looking into log files I saw some failed and few successful attempts for allocating resources from the node which failed (after) to the partner and vice versa.

> 2. What has changed since that last successful test? (assuming nothing)

I assume nothing, too :)), but I cannot be sure...

> 3. What are the public, private, and VIP IPs for these nodes?

Public and virtual are in the same network xxx.xxx.xxx.xxx/24, private IPs are completely different 10.10.10.xxx/24

> It seems
> at least possible that somehow there's a network misconfiguration
> (however unlikely that may be).
> It seems unusual for a VIP resource to be in UNKNOWN state since VIPs
> are generally lightweight and there's little effort associated with
> failover. When resources are in UNKNOWN, I generally try "crs_stop -f
> <resource_name>" to clear the current state. Then I'd try "crs_start
> -c <resource_name> <node-where-you-want-it-to-start>" to see if you
> can start it manually. Hopefully, that (possibly in combination with
> answers to the above questions) will yield something worth
> investigating.
> Dan

Nodes are remote, so its difficult to check the whole network physical configuration for problems/conflicts; I didn't try the crs_stop -f command, but I will if this issue raises again.

Many thanks for your help,

Alessandro

> Alessandro Vercelli wrote:
>
>The crash exact time is not clearly defined, in the morning of May 9th, it was
>a database crash, not system; crsd.log reported many messages like:
>
>2008-05-09 12:32:33.833: [ CRSEVT][3695033264]0CAAMonitorHandler :: 0:Action S
>cript /u01/app/oracle/product/crs/bin/racgwrap(check) timed out for ora.<failed
>node>.ons! (timeout=600)
>
>each message referred to a different resource.
>
>Last week, I tried to restart the failed node (in the meantime, other people ma
>de other attempts) and crsd.log reported, among other messages, the following:
>
>2008-07-07 16:10:18.743: [ CRSRES][3781585840]0CRS-1028: Dependency analysis f
>ailed because of:
>'Resource in UNKNOWN state: ora.<failednode>.vip'
>
>Using crs_stat -t the ora.<failednode>.vip resource allocation was on the partn
>er node - not the failed one - and its state was UNKNOWN (as expected).
>
>My opinion is that, at the crash time, the partner node performed an automatic
>failover but it failed; crsd.log of partner node:
>
>2008-05-09 11:55:55.278: [ CRSRES][3686595504]0Attempting to start `ora.<faile
>dnode>.vip` on member `<partnernode>`
>2008-05-09 11:56:58.305: [ CRSAPP][3686595504]0StartResource error for ora.<fa
>ilednode>.vip error code = -2
>2008-05-09 11:57:05.429: [ CRSEVT][3697085360]0CAAMonitorHandler :: 0:Action S
>cript /u01/app/oracle/product/crs/bin/racgwrap(check) timed out for ora.<failed
>node>.vip! (timeout=60)
>
>and, finally:
>
>2008-05-09 11:58:01.422: [ CRSRES][3686595504]0X_OP_StopResourceFailed : Stop
>Resource failed
>(File: rti.cpp, line: 1698
>
>2008-05-09 11:58:01.422: [ CRSRES][3686595504][ALERT]0`ora.<failednode>.vip` o
>n member `<partnernode>` has experienced an unrecoverable failure.
>2008-05-09 11:58:01.422: [ CRSRES][3686595504]0Human intervention required to
>resume its availability.
>2008-05-09 11:58:01.444: [ CRSRES][3686595504]0CRS-1028: Dependency analysis f
>ailed because of:
>'Resource in UNKNOWN state: ora.<failednode>.vip'
>
>Sorry for the *mess* of messages.....
>Thanks,
>Alessandro
>
>
>If you think it's related to the resource not starting because of some
>dependency, then I'd suggest looking at
>$CRS_HOME/log/<nodename>/crsd/crsd.log on each node (especially the
>crashed node) and see what's there around the time of startup.
>
>If the node won't boot, try booting it into single user mode and
>disabling clusterware from starting if you think clusterware is what's
>not allowing it to boot completely.
>
>Dan
>
>Alessandro Vercelli wrote:
>
>
>O.S.: RHEL AS4
>Hardware is HP BL45P, 4 x AMD Dual core, 8 Gb RAM.
>Oracle 10.2.0.1, RAC and Clusterware

<cut>

>The failed attempts reported on the console that the listener nodeapp could not
> start; looking into network configuration, I noticed vip IP address for the fa
>iling listener was not allocated on that node but on its partner; please, what
>log files do you suggest for errors?
>Thanks,
>Alessandro
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Jul 16 2008 - 05:25:17 CDT

Original text of this message