RE: When control files go bad

From: Goulet, Richard <Richard.Goulet_at_parexel.com>
Date: Mon, 1 Jun 2009 15:18:52 -0400
Message-ID: <23C4836D8E9C5F4280A66C0C247BC16F290E25F2_at_US-BOS-MX011.na.pxl.int>



Rich,

        I'd believe that flushing the san to disk did the job. Most san's mark a write complete when their disk cache has captured the write. The memory then gets flushed to disk on san shutdown (normally) or at predetermined times.

Dick Goulet
Senior Oracle DBA
PAREXEL International

-----Original Message-----

From: oracle-l-bounce_at_freelists.org
[mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Rich Jesse Sent: Monday, June 01, 2009 3:02 PM
To: Oracle L
Subject: When control files go bad

Hey all,

Our 10.1.0.5.0 DBs on AIX had some "issues" this weekend after the A/C suffered multiple failures in the server room. The DB server itself was OK,
but the SAN did an emergency shutdown from a temperature alarm.

Our SAN houses all datafiles, redo logs, archived logs, FRA, and 2/3 of the
control files (remember that last part!).

The alert.log shows something very close to this:

Sat May 30 18:10:57 2009
Errors in file /oracle/admin/db/bdump/oprd_ckpt_324056.trc:

ORA-00221: error on write to controlfile
ORA-00206: error in writing (block 3, # blocks 1) of controlfile
ORA-00202: controlfile: '/oracle/data/db/control02.ctl'
ORA-27072: File I/O error

IBM AIX RISC System/6000 Error: 5: I/O error Additional information: 9
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of controlfile
ORA-00202: controlfile: '/oracle/data/db/control01.ctl'
ORA-27072: File I/O error

IBM AIX RISC System/6000 Error: 5: I/O error Additional information: 9
Additional information: 3
Sat May 30 18:10:57 2009
CKPT: terminating instance due to error 221

After the A/C was back online and the ambient temp in operating range again,
the SAN was restarted and had it's cache flushed to disk. The DB server was
halted (not shutdown) and restarted. I started the DB manually with nomount, mount, and finally open, all successfully.

My question -- why??? I fully expected to have to rebuild the controlfile
or at least copy controlfile 3 back to 1 and 2, but all were apparently consistent prior to startup (in hindsight, I should have copied them to another place before attempting a restart!). And this same scenario was for
three DBs across two physical servers.

The current working theory is that Oracle had nothing to do with the controlfiles being up-to-date, but that it was the SAN flush to disk. Or is
it possible that Oracle determined that controlfile 3 was the up-to-date one
and did the copy back to 1 and 2 for me? I didn't think that functionality
existed since there's nothing in the alert.log about that and scanning the
docs didn't turn up anything either.

The last time I had this happen to me, there was no local controlfile and
the SAN got disconnected. I ended up rebuilding the controlfile from the
daily trace.

Thoughts?

Rich

--

http://www.freelists.org/webpage/oracle-l

--

http://www.freelists.org/webpage/oracle-l Received on Mon Jun 01 2009 - 14:18:52 CDT

Original text of this message