Nasty backup bug

From: Mladen Gogala <gogala.mladen_at_gmail.com>
Date: Sun, 13 Jan 2013 04:29:18 +0000 (UTC)
Message-ID: <pan.2013.01.13.04.29.18_at_gmail.com>



There is a nasty bug in Oracle RAC which causes multi-channel backups (backups from several nodes) to fail:

        Bug 10317487 - RMAN controlfile backup fails with ODM error ORA-17500 or ORA-245 [ID 10317487.8]

Oracle says that this bug should be resolved in 11.2.0.3, but I got the symptoms in 11.2.0.3.4 (64 bit). In RMAN logs the symptoms look like this:

ORA-00245: control file backup failed; target is likely on a local file system
continuing other job steps, job failed will not be re-run channel ch1: starting incremental level 1 datafile backup set channel ch1: specifying datafile(s) in backup set including current SPFILE in backup set
channel ch1: starting piece 1 at Jan 13 2013 04:24:55 channel ch1: finished piece 1 at Jan 13 2013 04:24:58 piece handle=3pnv9jgn_1_1 tag=TAG20130113T042436 comment=API Version 2.0,MMS Version 9.0.0.84
channel ch1: backup set complete, elapsed time: 00:00:03 channel ch2: finished piece 1 at Jan 13 2013 04:24:58 piece handle=3onv9jgl_1_1 tag=TAG20130113T042436 comment=API Version 2.0,MMS Version 9.0.0.84
channel ch2: backup set complete, elapsed time: 00:00:05 released channel: ch1
released channel: ch2

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ch1 channel at 01/12/2013 18:24:55
ORA-00245: control file backup failed; target is likely on a local file system
RMAN> The problem is in the snapshot controlfile name RMAN parameter which regulates taking control file snapshots. The RAC database with ASM for some reason thinks that this path should not reside on a local file system. Those snapshots also cannot reside on an ASM diskgroup.

There are three ways around it:

  1. Do a single channel backup, just from a single instance. That may not be unacceptable, if the client is large enough and has the IO capacity to handle all that IO.
  2. Put the snapshot on a raw disk device.
  3. Put the snapshot on a cluster file system. I tested with ACFS and it works. ACFS support is rather weak and needs a special linux kernel, doesn't work on the latest Red Hat kernels. GFS2 might work (haven't tested), NFS probably works because it's supported by RAC.

Of those options, a single channel is the easiest to configure, followed by the raw device:

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/dev/raw/raw5';

Cluster file systems will need more testing. I will try setting up a GFS2 and an NFS server the next weekend.

Another general question for the Oracle community is why backups always fail on weekends? I really don't appreciate being called from the office on weekends. Am I just plain unlucky or the others have experienced that failure propensity on weekends? Fortunately, this was not during the night hours, but the night is still young....Whoever that Murphy guy was, I'd wring his neck if he was around just now.

-- 
Mladen Gogala
The Oracle Whisperer
http://mgogala.byethost5.com
Received on Sun Jan 13 2013 - 05:29:18 CET

Original text of this message