Re: RAC Full cluster outage (almos)

From: Jason Heinrich <jheinrichdba_at_gmail.com>
Date: Wed, 11 Mar 2009 13:48:58 -0500
Message-ID: <b32e774d0903111148w5d0ab3dbl9e5597fd2e7f4dcc_at_mail.gmail.com>



The OP didn't mention how many voting disks were in the cluster. In order to successfully survive N node failures, it is recommended that there be 2N+1 voting disks. So a 2-node cluster should have 3 voting disks.

http://www.freelists.org/post/oracle-l/Voting-disk-TIE,5

--
Jason Heinrich


On Wed, Mar 11, 2009 at 1:09 PM, Christo Kutrovsky <
kutrovsky.oracle_at_gmail.com> wrote:


> Hi,
>
> We had similar problem, except node 2 evicted node 1 via the voting
> disk, which rebooted itself.
>
> In reality, a 2 node cluster is not reliable enought in network
> issues, as it is unknown which server should remain up. It's a 50/50
> chance.
>
> One approach is to have a 3 node cluster, with only 2 nodes running
> instances. The clusterware does not require any licenses, it is free.
>
> The 3th node only serves as an arbiter who should remain up.
>
> --
> Christo Kutrovsky
> Senior DBA
> The Pythian Group - www.pythian.com
> I blog at http://www.pythian.com/blogs/
>
>
> On Wed, Mar 11, 2009 at 11:35 AM, LS Cheng <exriscer_at_gmail.com> wrote:
> > Hi
> >
> > A couple of days one of my customers faced a almost full cluster outage
> in a
> > 2 node 10.2.0.4 RAC on Sun Solaris 10 Sparc (full oracle stack).
> >
> > The sequence was as follows
> >
> > 1. node 2 lost private network, interface went down
> > 2. node 1 evicts noe 2 (as expected)
> > 3. node 1 then evicts himself
> > 4. after nodes 1 returned to the cluster and cluster reformed from 1 node
> to
> > two nodes, node 2 lost private network again and this time eviction
> occurs
> > in node 2
> >
> > So it was not really a full cluster outage but the eviction occured one
> > after another so it looked full outage to the users.
> >
> > My doubt is, in a nodes cluster node 1 always survives which is not in
> this
> > case. My only theory is node 2 was so ill that it could not reboot the
> > server, node 1 then evicts himself to avoid corruptions.
> >
> > Any more ideas?
> >
> > Cheers
> >
> > --
> > LSC
>
>
>
>
-- http://www.freelists.org/webpage/oracle-l
Received on Wed Mar 11 2009 - 13:48:58 CDT

Original text of this message