Re: solaris 10 + 10gr2 + NOT a success story

From: hpuxrac <>
Date: 8 May 2007 14:28:27 -0700
Message-ID: <>

On May 8, 5:11 pm, "pmik" <> wrote:
> Hello everyone,
> We have recently installed 10gR2 RAC + ASM on 2 Sun servers running
> Solaris 10 OS. Since we started using the system (before applying
> patch we have been observing a rather strange behaviour.
> We start the CRS, nodeapps, asm and finally the database. crs_stat -t
> reports everything to be fine. Then, after a random period of time,
> one of the nodes' vip is lost, resulting to the eviction of the node.
> After applying the patch, the node is not evicted, the vip is
> reassigned to the other node and the instance keeps functioning
> properly. BUT, the vip continues to fail on random intervals.
> After searching through the OS's logs, in /var/adm/messages we get a
> message that the qfe (public IP nic) gets turned off and restored in
> a second at the specific moments that the vip service gets losts.
> After we experimented enough, we came to the discovery that if we
> keep pinging endlessly the public IPs of the the 2 nodes, the qfes of
> both the servers never get turned off and on. This way the RAC
> performs without any problems.
> Can someone of you explain this behavior? Is there something we are
> missing during or post installation? Is this a Solaris problem, an
> Oracle problem or a HW problem? We have almost eliminated the
> posibility for a NIC problem, since we switched the qfe nic for the
> ce nic (previously used for the interconnect) and we get the same
> behavior, this time for the ce card.
> Any help and/or guidance is extremely welcome since we seem to run
> out of options and directions. If anyone is interested, we will
> gladly provide any logs or elaborate on the matter.
> Thank you very much for your time and interest.
> Petros Mikos

Here's a couple of thoughts may not be what you want to hear though.

First, if you are running a production environment on RAC it is just suicidal to not have an equivalent and probably almost if not exactly identical test RAC environment. This type of high availability environment calls for dedication of extra resources not just personnel but extra hardware, software, storage etc to be able to do adequate testing.

If you don't have that type of environment trying to run RAC by itself in production and the other systems are not RAC ... it's time to keep your resume updated because some bad things are likely to happen.

See what Moans Nogood has to say about high availability and RAC ... that's the kind of honest perspective that you need to keep in mind.

As far as your specific problem it's time to work aggressively with both oracle and sun. This should be a sev 1 service request that you keep open until oracle provides you with the information and support that you need. Received on Tue May 08 2007 - 16:28:27 CDT

