Re: DB recovery 'opportunity'

From: DA Morgan <damorgan_at_psoug.org>
Date: Fri, 29 Jul 2005 18:14:41 -0700
Message-ID: <1122686053.256727@yasure>

Ed Stevens wrote:

> DA Morgan wrote:
>

>>Ed Stevens wrote:
>>
>>>Platform - Oracle 8.1.7 on Win2k server
>>>
>>>Disclaimer: I was called in to pull this db out of the fire. I have
>>>never seen this db before, and had no hand in its setup or current
>>>condition. The guy that normally covers this db is unavailable.
>>>
>>>App and DB starting reporting problems on Jul 22. I was called in on
>>>July 28.
>>>Startup of db fails. Here are the last few lines from the alert log:
>>>
>>>Completed: ALTER DATABASE MOUNT
>>>Thu Jul 28 14:33:18 2005
>>>ALTER DATABASE OPEN
>>>ARCH: Beginning to archive log# 3 seq# 831
>>>Thu Jul 28 14:45:10 2005
>>>ARCH: I/O error 19502 archiving log 3 to 'E:\ORAARCH\XVLP\ARCH_831.ARC'
>>>ARCH: Archiving not possible: error count exceeded
>>>ARCH: Failed to archive log# 3 seq# 831
>>>ORA-16038 signalled during: ALTER DATABASE OPEN...
>>>
>>>The first thing I checked was available disk space at the archive
>>>destination. There were several dozen gig available. All web serches
>>>(MetaLink, this ng, AskTom, Google ...) keep pointing to disk full
>>>conditions. We do know that the server admins have been monkying with
>>>the disks, which are in a SAN unit. We have gotten little info from
>>>them ... they (and the server) are located in Mexico, and their English
>>>is little better than our Spanish.
>>>
>>>Further tidbits:
>>>
>>>There is very little alert log history available. There are scripts on
>>>the server for stopping and starting the DB's (two of them) and part of
>>>the shutdown renames the alert log to a backup, keeping only three
>>>generations. Unfortunately, this was done 3 times in one day -- after
>>>the problems began -- so any info on what led into the current
>>>situation has been lost.
>>>
>>>On the day the problems began the orginal DBA, for reasons unknown,
>>>modified the init.ora file, removing references to the 2d and 3d
>>>control files. Those files still exist, but of course are out of sync
>>>with the one remaining active file.
>>>
>>>Now, for the real kicker ... there are no backups ......
>>>
>>>Fortunately, the way this app shares data with the mainframe, we *CAN*
>>>recover by recreating the db from scratch and having the app issue a
>>>request to reload from the mainframe. But as an educational exercise,
>>>I'd like to explore other possibilities -- just in case I find myself
>>>in a similar situation with an app that doesn't so easily recover
>>>itself.
>>>
>>>The course of action that seems best is to:
>>>1 - stop the db
>>>2 - copy the one active control file over the two old ones (with
>>>corresponding renaming)
>>>3 - re-instate the control file references in init.ora
>>>4 - startup nomount
>>>5 - open resetlogs
>>>
>>>What say the jury?
>>>
>>>Yes, I've modified the shutdown script to keep much more alert.log
>>>history, and will be addressing the lack of backup ...
>>
>>I say what in the alert log that you showed us makes you think a control
>>file has anything to do with the problem?
>>
>>I say a quick trip to metalink with the error messages will quickly
>>reveal a solution.
>>
>>Based on your disclaimer I am not in the market to give you a solution
>>given what appears to be your lack of experience and concerns that 'the
>>solution' might make things worse than they already are.
>>--
>>Daniel A. Morgan
>>http://www.psoug.org
>>damorgan_at_x.washington.edu
>>(replace x with u to respond)

> 
> 
> Daniel,
> 
> Well, nothing in the alert log made *me* think the control file had
> anything to do with the i/o error and failure to start.  I can't answer
> for the guy who made that change to the init.ora file.  I simply
> brought it up as another irregularity in a bad situation and another
> thing that also needs to be fixed.
> 
> Searches of MetaLink (and other sites listed in my original) keep
> pointing to being out of disk space as the cause, and the cure to
> simply create (by whatever means) more space, and the db will
> self-recover.  I have certainly seen this many times when an archive
> destination filled up, but in this case, we have plenty of space.
> However, at this point I'm assuming the same applies: fix the disk
> problem (whatever it is) and the db will self-recover.
> 
> And we definately have a more fundamental disk problem.  Further
> investigation finds the Windows event log flooded with msgs about
> writes to the page file timing out.  And as an experiment, I went to
> the archive destination directory and tried to simply copy one of the
> older archivelog files (only 10 mb in size) to 'dummy.log'.  An hour
> later, when it had not finished, and all other tasks on the server
> seemed to be grinding to a halt, I gave up and killed my remote
> session.
> 
> So at this point, we've tossed it back to the server and storage admins
> and informed all concerned that we can't do any more until the server
> itself, and its disk problems are stabelized.  When that is done I'll
> try a simple restart of the DB and see what happens, then go from there.

Ok then here's my advice for what it is worth ....

conn / as sysdba
shutdown abort;
clear up disk space
startup mount exclusive;
alter system archive log all to '<some_other_location>'; alter system noarchivelog;
alter system archive log all to '<original_location>'; alter system archivelog;
alter system open;

-- 
Daniel A. Morgan
http://www.psoug.org
damorgan_at_x.washington.edu
(replace x with u to respond)

Received on Fri Jul 29 2005 - 20:14:41 CDT

Re: DB recovery 'opportunity' - not urgent