Thoughts on instance crash and RAC

From: vsevolod afanassiev <vsevolod.afanassiev_at_gmail.com>
Date: Tue, 20 Aug 2013 04:48:13 -0700 (PDT)
Message-ID: <98794f2c-ace0-4567-bad3-ee1a4993a9d5_at_googlegroups.com>



Some thoughts on instance crash and RAC:
- Main selling point of RAC is that it increases availability: if one node crashes other nodes are still available. During presentations Oracle guys would crash an instance by killing PMON and show that the database is still accessible through other nodes.
  • However in my experience this isn't very realistic scenario. Oracle instances rarely experience such “instant” crash – an instance was running without problems and then few seconds later it is completely gone. More often an instance would struggle for a while – may be for 10 minutes, may be for half an hour. Things like latch waits, library cache wait, etc. would appear, may be with ORA-00600 and 7445 errors, may be 4031. However crash is unlikely or it will take a while. Instance "death" became slow and painful.
  • Once we had 9.2 instance that experienced shared pool fragmentation and started reporting 4031 errors – initially every 5 min, then every minute, and then every few seconds. Eventually it crashed but it took almost 24 hours.
  • It seems that instance crash is even less likely in the latest versions – probably Oracle introduced various timeouts and where an older version would crash versions 10 (especially 10.2.0.5) and 11.2 tend to freeze for a while and then sort of unfreeze and continue processing.

I think the same applies to OS crashes – when I worked in a Sun-only shop 15 years ago we had frequent SunOS crashes due to kernel panic. This just doesn't happen any more. We may get swap full due to memory leak or incorrect configuration – in this case running on virtual server helps as we could add memory, stop “cannot fork process” errors, and then add swap. But “out of the blue” crash of Solaris 10/11, AIX 6.1 (even 5.3 TL12), RHEL 5 seems very, very unlikely.

P.S. "A huge amount of your time is spent on traffic between instances, and the most obvious strategy is to shut down one instance and run single instance. This may not be a politically correct suggestion, though." - this is from Jonathan Lewis:

https://forums.oracle.com/thread/1050776 Received on Tue Aug 20 2013 - 13:48:13 CEST

Original text of this message