Re: Nasty RAC Bug in 10g. If you are running multi-nodes and one instance or more is not normally running - Read this...

From: Ravi Gaur <ravigaur1_at_gmail.com>
Date: Tue, 7 Apr 2009 10:00:45 -0500
Message-ID: <289232290904070800w430ae1fbv9738fab803187c8c_at_mail.gmail.com>



Thanks Robert for bringing this up!!
I inquired about this bug from Oracle Support (cause we plan to add/drop redo log groups in our RAC env). I'm sharing the response I got --

Look for all the following:
- RAC configuration (ie multiple redo threads) - looking at the alert logs of all the instances, they show the following sequence of events in order: 1) the DBA dropped a specific log group# <x> belonging to a redo thread whose instanceA was already in a shutdown state (ie their redo thread was closed) 2) then the DBA added log group# <x> to the redo thread belonging to a currently running instanceB 3) then the currently running instanceB switched logs into the new log group# <x> and then began writing to the old member logfiles which were formerly members of the log group# <x> before it was dropped 4) there are LGWR errors, instance failure, and redo log corruption
for example, here is one possible set of errors seen in the case where the old logfiles were smaller than the new logfiles (the errors are reported when LGWR tries to write redo beyond the end of the old logfiles) LGWR reported:
ORA-00340: IO error processing online log of thread
ORA-00345: redo log write error block <blk#> count <cnt>
ORA-00312: online log <log#> thread <blk#>: '<old_logfile>'
ORA-17510: Attempt to do i/o beyond file size
other processes (eg pmon,lmd,lms,lmon) reported: ORA-00340: IO error processing online log of thread and LGWR terminated the instance
on instance restart, crash recovery failed with: ORA-00314: log <log#> of thread <thr#>,
expected sequence# <seq#> doesn't match 0 ORA-00312: online log <log#> thread <thr#>: 'new_logfile' LGWR stack:
-> ksbrdp() -> ksbabs() -> kcrfw_redo_write() ->
-> kcrfw_post() -> -> kcrfwcint() -> ORA-00345
- there is another (less severe) variation of this bug,
the sequence of events is almost the same as descibed above except in step (2), the DBA added a different log group#, and the error is different - LGWR reports an ORA-00600:[kcrf_cached_open_log_1] when it tries to switch into the wrong logfile, and in this case, no redo is actually written to the wrong logfile, and after the instance has terminated, it can be restarted without any problems.
  • And the WORKAROUND, mentioned as follows:

To avoid the possiblity of encountering this problem in the first place, set the following event in the init.ora's of all the instances:
event="10468 trace name context forever, level 2" The other side effect of doing this, is that now instance recovery may be slower in cases where the logfiles are located on ASM disk groups (see bug 4967266). If the problem has already happened and the instance terminated and the database can nolonger be opened, then In order to recover the database to a consistent (earlier) state, do media recovery and apply redo up until just before the wrong online redo log file was switched into.

  • Currently, there is no backport patch available for bug 6786022 for your platform (Sun Solaris Sparc) .. but we can request one, if needed, on top of 10.2.0.4

eos


I'm not sure if I follow the sequence so I'm going to question them again.

  • Ravi Gaur

On Sun, Apr 5, 2009 at 2:12 AM, Robert Freeman <robertgfreeman_at_yahoo.com>wrote:

>
> So, we ran into a nasty bug last night. We are running 10g (various
> releases) RAC on 3 or 4 node clusters. In this particular configuration we
> had a 4 node cluster, with an instance for this database on each node. 2
> instances were active, two were configured but not running.
>
> DBA went to make redo log adjustments (adding a new group) and database
> crashed. There is a bug in 10g (and apparently 11g) with respect to this
> kind of configuration. If you are running an active/passive kind of RAC
> configuration, you will want to read up on the bug. Be very careful making
> any online redo log changes if you are running in such an environment.
>
> Metalink bug number is 6786022 and it's public. We understand patch is in
> QA to correct. There is also an event you can set to avoid the problem. See
> the bug on Metalink for more information.
>
> I'll also be posting a copy of this on my Blog...
>
> Cheers to all!
>
> RF
>
>
> Robert G. Freeman
> Author:
> Blog: http://robertgfreeman.blogspot.com
> OCP: Oracle Database 11g Administrator Certified Professional Study Guide
> (Sybex)
> Oracle Database 11g New Features (Oracle Press)
> Portable DBA: Oracle (Oracle Press)
> Oracle Database 10g New Features (Oracle Press)
> Oracle9i RMAN Backup and Recovery (Oracle Press)
> Oracle9i New Features (Oracle Press)
> Other various titles out of print now...
> The LDS Church is looking for DBA's. You do have to be a Church member in
> good standing. A lot of kind people write me, concerned I may be breaking
> the law by saying you have to be a Church member. It's legal I promise! :-)
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Apr 07 2009 - 10:00:45 CDT

Original text of this message