RE: Synopsis of a database crash and recovery (or time to bash RA

From: Deshpande, Kirti <Kirti.Deshpande_at_gtedc.gte.com>
Date: Mon, 12 Jun 2000 08:54:48 -0500
Message-Id: <10526.108654@fatcity.com>

In my previous job, one of our client's the local 'sys admin/dba type' person left for another job. After that no one out there bothered to pay attention to the error logs that showed tape errors. We used to perform nightly exports (of 6.0.13 database) and then copied the export file to a daily/weekly/monthly backup tape. When the disk controller crashed and a bunch of disks were trashed, IBM replaced all that H/W. We were not called at that time, since it was determined that those drives needed replacements right away. Later, I was left to recover the database from the exported file from the previous night's backup tape. Found out that the last readable tape was 10 months old. Asked them to ship all those tapes to us, where we tried to read them on different tape units. There was nothing to read and there was nothing anyone could do (spent 3 days with Oracle and IBM and OnTrack with no success), while the business switched to manual mode of operation. We had to acquire a couple of 'loaner' machines to rebuild database from 10 month old export file and re-enter all the info from micro film, hard copy reports etc. While this 'rebuilding' was going on, a fresh new database was created to support on-going business. Then we had to re-synch all this mess on a weekly basis. Took more than 8 months (and thousands of person hours) to finally 'catch up'. And this all happened in a Hospital (a small one, though) where 24X7 was a requirement !!

What I learnt from this: (1) Read that tape after writing to it before sending it offsite. (2) Do not initialize/rebuild disk volume groups until all avenues of reading those are explored. (3) Disaster Recovery Drill is worth the money and efforts.

Kirti

> -----Original Message-----
> From: Rachel Carmichael [SMTP:carmichr_at_hotmail.com]
> Sent: Sunday, June 11, 2000 10:04 PM
> To: Multiple recipients of list ORACLE-L
> Subject: Re: Synopsis of a database crash and recovery (or time to
> bash RAID 5).
>
> Paul,
>
> having gone through a somewhat similar experience, my heart goes out to
> you.
> I do have one question though:
>
> how come NO ONE noticed that the exports weren't being done, that the
> backups were being done with an open database and that the tape drive was
> gone and no backups were being done?
>
> You said that the server crashed in January and that exports and shutdown
> of
> database before backup had therefore not been done since then. HOW COME NO
>
> ONE NOTICED???????????????????????????????? We are talking over 5 months
> here.
>
> Rachel
>
>
> >From: Paul Drake <paled_at_home.com>
> >Reply-To: ORACLE-L_at_fatcity.com
> >To: Multiple recipients of list ORACLE-L <ORACLE-L_at_fatcity.com>
> >Subject: Synopsis of a database crash and recovery (or time to bash RAID
> >5).
> >Date: Sun, 11 Jun 2000 16:54:06 -0800
> >
> >This past week, an Oracle Database (v7.3.4 Workgroup) on WinNT Server
> >4.0 crashed at a remote Client Site. Database running NOARCHIVELOG.
> >Single RAID 5 volume (4 drives), single hardware RAID controller.
> >It was determined that the root cause of the crash was a faulty RAID
> >controller - and that the volume was unavailable for read/write.
> >That's where the problem seemingly started.
> >Okay, not a huge deal yet, as we have 2 options for recovery - last cold
> >backup, or import last full export (executed fresh daily).
> >It turned out that the tape drive had failed weeks earlier - and no
> >backups had been taken in quite some time.
> >Uh oh. Okay, well - we still have the dump file, right?
> >Wrong.
> >In January this server had a catastrophic failure during a move - and
> >had to be restored from tape.
> >Backup was made with NTBackup - without backing up the registry. Had to
> >re-install oracle binaries.
> >Database was restored and online in 4.5 hours after the call was
> >reported - not great, not bad.
> >What did not take place was the re-scheduling of jobs run by the
> >operating system.
> >Without the scheduled jobs running - the database had not been shut down
> >before *cold* backups.
> >So those backups were worthless *hot* backups run without taking
> >tablespaces offline.
> >Without the scheduled jobs running - the daily export job had not
> >executed.
> >So the recovery options are from an export from January, before the
> >crash then.
> >Okay, we'll try to recover the database.
> >Startup mount - no problem. Can view all of the datafiles, status is
> >ONLINE.
> >Can view the online redo logs - all seem to be fine.
> >Alter database open - ORA-03113 - end of file on communication channel.
> >Core dump.
> >Attempted to mount and recover database - received mesage that no
> >recovery was needed.
> >Called oracle support.
> >
> >Opened a severity 1 TAR.
> >
> >Support stepped me through attempts to re-open and recover the database.
> >Still ORA-03113.
> >Got a full backup of all the existing files before they broke out the
> >jackhammer.
> >After exhausting all options, had to force open the database - which was
> >them corrupted.
> >I purposely forgot that init parameter used to force it open - I never
> >want to see it again.
> >Got most of the data out - still some was inaccessible - so recovery was
> >incomplete.
> >This event cost me more than 2 days of time that I didn't have.
> >Grabbed the compressed export files and imported them into a new
> >instance on my machine at work.
> >The crashed Server was rebuilt during this time - 2 RAID 1 volumes (new
> >RAID controller) - new OS install.
> >
> >** What I really wanted to get across is this: **
> >If you're a relatively new to managing Oracle Databases - particularly
> >on WinNT - please understand this:
> >
> >Running all files on a single RAID 5 volume is extremely bad.
> >Log files and control files most certainly should not be stored on RAID
> >5 volumes.
> >Swap space on RAID 5? Are you kidding?
> >(A well-tuned Oracle Instance won't be using the OS pagefile.sys at all
> >anyway)
> >
> >As someone else on the list once said: (to summarize)
> >
> >You're better off running JBOD (just a bunch of drives) that run only
> >RAID 5.
> >Maybe just mirror your OS and oracle binaries, control files, parameter
> >files.
> >Have the other drives set up as single drive RAID 0 volumes (or no
> >RAID).
> >Have a solid backup strategy in place, configure a disaster recovery
> >agent to avoid a bare metal recovery.
> >If the database is going to be at a remote site, use third party backup
> >utilities for hot backups.
> >Its not that hard to write the hot backup script - but it is more
> >difficult to restore from a home-grown script than to have a GUI in
> >front of the user that may be performing the recovery.
> >If you wrote the scripts to perform the hot backup - you *will* be
> >performing the recovery.
> >If its just a pre-configured restore job to run in a tool such as
> >Veritas NT Backup - even a Mac User could run it.
> >
> >If you get the chance to specify the box - use multiple RAID controllers
> >and DUPLEX across them.
> >When the machine loses a RAID controller - you can keep running until
> >the new one arrives, without even a hiccup.
> >
> >I haven't completely sworn off RAID 5 - I think that its a good option
> >compared with running RAID 0 for READ ONLY tablespaces. But for anything
> >that you have to write to - I would have to recommend against it.
> >
> >As far as recovery options running NOARCHIVELOG - there are 4:
> > recover from cold backup
> > recover from logical export
> > dice.com (dbajobs.com, etc.).
> > the 10K tool from Oracle.
> >
> >My ideal config uses 2 dual-channel RAID controllers, you have 4 I/O
> >channels - 2 internal and 2 external. The newer 5U rack mount storage
> >cabinets can contain up to 14 drives.
> >Just demand the "extra hardware".
> >Make sure that the backplanes are split - internal and external. Order
> >the extra cables needed.
> >Duplex all RAID volumes. Yes, you'll take a slight hit on throughput.
> >Big deal.
> >One more pair of drives would meet OFA standards (7 vols). Couldn't fit
> >it in this config.
> >So I put system on volume 0.
> >
> >Volume RAID Drives Size GB tablespaces Stores
> >0 1 2 8.7 System OS, Oracle Binaries,
> Control File1
> >1 1 2 8.7 4 online redo_logs,
> archlogs, export files
> >2 1 2 8.7 RBS control file2
> >3 1 2 8.7 TEMP control file3
> >4 1 2 8.7 INDEX_DATA
> >5 0+1 4+ 17.4 USER_DATA
> >
> >This config had 6 internal drives, 8 external drives - no hot spares.
> >I like the idea of having a pair of drives that are only writing
> >actively to the redo logs. (except for nightly exports).
> >This keeps the drive heads on the current redo log track - not searching
> >all over the drive for whatever block is asked of it.
> >If the drive heads are already on the right track, 1/10,000th of a
> >second isn't long to wait for a write, compared with a 7 ms avg seek
> >time.
> >With Ultra 160/m drives these days and 64 bit, 66 MHz PCI buses, access
> >times are the rate-limiting factor - not pure I/O throughput.
> >If you have a write-back cache enabled, its not such an issue - but I'm
> >still a little sceptical to enable that, even with a battery backup on
> >the controller card and a UPS feeding the server.
> >
> >One more thing - the entire GUI concpt usually lacks the most important
> >thing - a scripted way to reproduce the configuration that you just
> >made. If you are going to re-create from bare metal, you have to be able
> >to reproduce all of your Database's settings.
> >Don't use the GUI NT Resouce Kit scheduler for adding jobs - do it with
Received on Mon Jun 12 2000 - 08:54:48 CDT