Re: Thoughts on instance crash and RAC

From: Mark D Powell <>
Date: Tue, 20 Aug 2013 10:51:02 -0700 (PDT)
Message-ID: <>

On Tuesday, August 20, 2013 7:48:13 AM UTC-4, vsevolod afanassiev wrote:
> Some thoughts on instance crash and RAC: - Main selling point of RAC is that it increases availability: if one node crashes other nodes are still available. During presentations Oracle guys would crash an instance by killing PMON and show that the database is still accessible through other nodes. - However in my experience this isn't very realistic scenario. Oracle instances rarely experience such “instant” crash – an instance was running without problems and then few seconds later it is completely gone. More often an instance would struggle for a while – may be for 10 minutes, may be for half an hour. Things like latch waits, library cache wait, etc. would appear, may be with ORA-00600 and 7445 errors, may be 4031. However crash is unlikely or it will take a while. Instance "death" became slow and painful. - Once we had 9.2 instance that experienced shared pool fragmentation and started reporting 4031 errors – initially every 5 min, then every minute, and then every few seconds. Eventually it crashed but it took almost 24 hours. - It seems that instance crash is even less likely in the latest versions – probably Oracle introduced various timeouts and where an older version would crash versions 10 (especially and 11.2 tend to freeze for a while and then sort of unfreeze and continue processing. I think the same applies to OS crashes – when I worked in a Sun-only shop 15 years ago we had frequent SunOS crashes due to kernel panic. This just doesn't happen any more. We may get swap full due to memory leak or incorrect configuration – in this case running on virtual server helps as we could add memory, stop “cannot fork process” errors, and then add swap. But “out of the blue” crash of Solaris 10/11, AIX 6.1 (even 5.3 TL12), RHEL 5 seems very, very unlikely. P.S. "A huge amount of your time is spent on traffic between instances, and the most obvious strategy is to shut down one instance and run single instance. This may not be a politically correct suggestion, though." - this is from Jonathan Lewis:

An alternate to RAC is buy a big enough machine to support your user load and then run Data Guard to provide a failover instance.

But yes over the years were have had numerous issues that were RAC only. That is if we were running on a non-RAC system we would not had hit the bug since it was RAC only. Then again we also had a case where on once instance crashing we were able to use the surviving instance to make corruption repairs to a sys owned base table that prevented Oracle from re-starting. (DDL to drop and recreate affected object).

HTH -- Mark D Powell -- Received on Tue Aug 20 2013 - 19:51:02 CEST

Original text of this message