Re: DB recovery 'opportunity'

From: Ed Stevens <ed.stevens_at_comcast.net>
Date: 29 Jul 2005 12:37:20 -0700
Message-ID: <1122665840.276869.108640@g47g2000cwa.googlegroups.com>

DA Morgan wrote:
> Ed Stevens wrote:
> > Platform - Oracle 8.1.7 on Win2k server
> >
> > Disclaimer: I was called in to pull this db out of the fire. I have
> > never seen this db before, and had no hand in its setup or current
> > condition. The guy that normally covers this db is unavailable.
> >
> > App and DB starting reporting problems on Jul 22. I was called in on
> > July 28.
> > Startup of db fails. Here are the last few lines from the alert log:
> >
> > Completed: ALTER DATABASE MOUNT
> > Thu Jul 28 14:33:18 2005
> > ALTER DATABASE OPEN
> > ARCH: Beginning to archive log# 3 seq# 831
> > Thu Jul 28 14:45:10 2005
> > ARCH: I/O error 19502 archiving log 3 to 'E:\ORAARCH\XVLP\ARCH_831.ARC'
> > ARCH: Archiving not possible: error count exceeded
> > ARCH: Failed to archive log# 3 seq# 831
> > ORA-16038 signalled during: ALTER DATABASE OPEN...
> >
> > The first thing I checked was available disk space at the archive
> > destination. There were several dozen gig available. All web serches
> > (MetaLink, this ng, AskTom, Google ...) keep pointing to disk full
> > conditions. We do know that the server admins have been monkying with
> > the disks, which are in a SAN unit. We have gotten little info from
> > them ... they (and the server) are located in Mexico, and their English
> > is little better than our Spanish.
> >
> > Further tidbits:
> >
> > There is very little alert log history available. There are scripts on
> > the server for stopping and starting the DB's (two of them) and part of
> > the shutdown renames the alert log to a backup, keeping only three
> > generations. Unfortunately, this was done 3 times in one day -- after
> > the problems began -- so any info on what led into the current
> > situation has been lost.
> >
> > On the day the problems began the orginal DBA, for reasons unknown,
> > modified the init.ora file, removing references to the 2d and 3d
> > control files. Those files still exist, but of course are out of sync
> > with the one remaining active file.
> >
> > Now, for the real kicker ... there are no backups ......
> >
> > Fortunately, the way this app shares data with the mainframe, we *CAN*
> > recover by recreating the db from scratch and having the app issue a
> > request to reload from the mainframe. But as an educational exercise,
> > I'd like to explore other possibilities -- just in case I find myself
> > in a similar situation with an app that doesn't so easily recover
> > itself.
> >
> > The course of action that seems best is to:
> > 1 - stop the db
> > 2 - copy the one active control file over the two old ones (with
> > corresponding renaming)
> > 3 - re-instate the control file references in init.ora
> > 4 - startup nomount
> > 5 - open resetlogs
> >
> > What say the jury?
> >
> > Yes, I've modified the shutdown script to keep much more alert.log
> > history, and will be addressing the lack of backup ...
>
> I say what in the alert log that you showed us makes you think a control
> file has anything to do with the problem?
>
> I say a quick trip to metalink with the error messages will quickly
> reveal a solution.
>
> Based on your disclaimer I am not in the market to give you a solution
> given what appears to be your lack of experience and concerns that 'the
> solution' might make things worse than they already are.
> --
> Daniel A. Morgan
> http://www.psoug.org
> damorgan_at_x.washington.edu
> (replace x with u to respond)

Daniel,

Well, nothing in the alert log made *me* think the control file had anything to do with the i/o error and failure to start. I can't answer for the guy who made that change to the init.ora file. I simply brought it up as another irregularity in a bad situation and another thing that also needs to be fixed.

Searches of MetaLink (and other sites listed in my original) keep pointing to being out of disk space as the cause, and the cure to simply create (by whatever means) more space, and the db will self-recover. I have certainly seen this many times when an archive destination filled up, but in this case, we have plenty of space. However, at this point I'm assuming the same applies: fix the disk problem (whatever it is) and the db will self-recover.

And we definately have a more fundamental disk problem. Further investigation finds the Windows event log flooded with msgs about writes to the page file timing out. And as an experiment, I went to the archive destination directory and tried to simply copy one of the older archivelog files (only 10 mb in size) to 'dummy.log'. An hour later, when it had not finished, and all other tasks on the server seemed to be grinding to a halt, I gave up and killed my remote session.

So at this point, we've tossed it back to the server and storage admins and informed all concerned that we can't do any more until the server itself, and its disk problems are stabelized. When that is done I'll try a simple restart of the DB and see what happens, then go from there. Received on Fri Jul 29 2005 - 14:37:20 CDT

Re: DB recovery 'opportunity' - not urgent