Re: DB recovery 'opportunity'

From: Matthias Hoys <idmwarpzone_NOSPAM__at_yahoo.com>
Date: Fri, 29 Jul 2005 22:47:40 +0200
Message-ID: <42ea95ec$0$18671$ba620e4c@news.skynet.be>

"Ed Stevens" <ed.stevens_at_comcast.net> wrote in message news:1122665840.276869.108640_at_g47g2000cwa.googlegroups.com...

> DA Morgan wrote:

>> Ed Stevens wrote:
>> > Platform - Oracle 8.1.7 on Win2k server
>> >
>> > Disclaimer: I was called in to pull this db out of the fire. I have
>> > never seen this db before, and had no hand in its setup or current
>> > condition. The guy that normally covers this db is unavailable.
>> >
>> > App and DB starting reporting problems on Jul 22. I was called in on
>> > July 28.
>> > Startup of db fails. Here are the last few lines from the alert log:
>> >
>> > Completed: ALTER DATABASE MOUNT
>> > Thu Jul 28 14:33:18 2005
>> > ALTER DATABASE OPEN
>> > ARCH: Beginning to archive log# 3 seq# 831
>> > Thu Jul 28 14:45:10 2005
>> > ARCH: I/O error 19502 archiving log 3 to 'E:\ORAARCH\XVLP\ARCH_831.ARC'
>> > ARCH: Archiving not possible: error count exceeded
>> > ARCH: Failed to archive log# 3 seq# 831
>> > ORA-16038 signalled during: ALTER DATABASE OPEN...
>> >
>> > The first thing I checked was available disk space at the archive
>> > destination. There were several dozen gig available. All web serches
>> > (MetaLink, this ng, AskTom, Google ...) keep pointing to disk full
>> > conditions. We do know that the server admins have been monkying with
>> > the disks, which are in a SAN unit. We have gotten little info from
>> > them ... they (and the server) are located in Mexico, and their English
>> > is little better than our Spanish.
>> >
>> > Further tidbits:
>> >
>> > There is very little alert log history available. There are scripts on
>> > the server for stopping and starting the DB's (two of them) and part of
>> > the shutdown renames the alert log to a backup, keeping only three
>> > generations. Unfortunately, this was done 3 times in one day -- after
>> > the problems began -- so any info on what led into the current
>> > situation has been lost.
>> >
>> > On the day the problems began the orginal DBA, for reasons unknown,
>> > modified the init.ora file, removing references to the 2d and 3d
>> > control files. Those files still exist, but of course are out of sync
>> > with the one remaining active file.
>> >
>> > Now, for the real kicker ... there are no backups ......
>> >
>> > Fortunately, the way this app shares data with the mainframe, we *CAN*
>> > recover by recreating the db from scratch and having the app issue a
>> > request to reload from the mainframe. But as an educational exercise,
>> > I'd like to explore other possibilities -- just in case I find myself
>> > in a similar situation with an app that doesn't so easily recover
>> > itself.
>> >
>> > The course of action that seems best is to:
>> > 1 - stop the db
>> > 2 - copy the one active control file over the two old ones (with
>> > corresponding renaming)
>> > 3 - re-instate the control file references in init.ora
>> > 4 - startup nomount
>> > 5 - open resetlogs
>> >
>> > What say the jury?
>> >
>> > Yes, I've modified the shutdown script to keep much more alert.log
>> > history, and will be addressing the lack of backup ...
>>
>> I say what in the alert log that you showed us makes you think a control
>> file has anything to do with the problem?
>>
>> I say a quick trip to metalink with the error messages will quickly
>> reveal a solution.
>>
>> Based on your disclaimer I am not in the market to give you a solution
>> given what appears to be your lack of experience and concerns that 'the
>> solution' might make things worse than they already are.
>> --
>> Daniel A. Morgan
>> http://www.psoug.org
>> damorgan_at_x.washington.edu
>> (replace x with u to respond)

>
> Daniel,
>
> Well, nothing in the alert log made *me* think the control file had
> anything to do with the i/o error and failure to start.  I can't answer
> for the guy who made that change to the init.ora file.  I simply
> brought it up as another irregularity in a bad situation and another
> thing that also needs to be fixed.
>
> Searches of MetaLink (and other sites listed in my original) keep
> pointing to being out of disk space as the cause, and the cure to
> simply create (by whatever means) more space, and the db will
> self-recover.  I have certainly seen this many times when an archive
> destination filled up, but in this case, we have plenty of space.
> However, at this point I'm assuming the same applies: fix the disk
> problem (whatever it is) and the db will self-recover.
>
> And we definately have a more fundamental disk problem.  Further
> investigation finds the Windows event log flooded with msgs about
> writes to the page file timing out.  And as an experiment, I went to
> the archive destination directory and tried to simply copy one of the
> older archivelog files (only 10 mb in size) to 'dummy.log'.  An hour
> later, when it had not finished, and all other tasks on the server
> seemed to be grinding to a halt, I gave up and killed my remote
> session.
>
> So at this point, we've tossed it back to the server and storage admins
> and informed all concerned that we can't do any more until the server
> itself, and its disk problems are stabelized.  When that is done I'll
> try a simple restart of the DB and see what happens, then go from there.
>

Could be a problem with the HBA, the fiber cable, incorrect HBA Windows drivers or config, ... Work for the storage admins ! Which SAN are you using ? Windows 2000 is not really SAN friendly, Windows 2003 is much better ...

Matthias Received on Fri Jul 29 2005 - 15:47:40 CDT

Re: DB recovery 'opportunity' - not urgent