Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: Synopsis of a database crash and recovery (or time to bash RAID 5).

Re: Synopsis of a database crash and recovery (or time to bash RAID 5).

From: Charlie Mengler <charliem_at_mwh.com>
Date: Sun, 11 Jun 2000 18:25:34 -0700
Message-Id: <10525.108578@fatcity.com>


Let this be a MAJOR learning lesson for all DBA's reading this list.

If you are NOT doing test restorations at least ONCE every THREE months, you are courting disaster. Everyone needs the practice so a restoration just becomes second nature & so that you can refine scripts to complete the task in your environment. By using tested & debugged scripts, you are MUCH more likely to produce a successful restored database with the minimum of downtime.

HAND!
>
> Paul Drake wrote:
>
> > This past week, an Oracle Database (v7.3.4 Workgroup) on WinNT Server
> > 4.0 crashed at a remote Client Site. Database running NOARCHIVELOG.
> > Single RAID 5 volume (4 drives), single hardware RAID controller.
> > It was determined that the root cause of the crash was a faulty RAID
> > controller - and that the volume was unavailable for read/write.
> > That's where the problem seemingly started.
> > Okay, not a huge deal yet, as we have 2 options for recovery - last cold
> > backup, or import last full export (executed fresh daily).
> > It turned out that the tape drive had failed weeks earlier - and no
> > backups had been taken in quite some time.
> > Uh oh. Okay, well - we still have the dump file, right?
> > Wrong.
> > In January this server had a catastrophic failure during a move - and
> > had to be restored from tape.
> > Backup was made with NTBackup - without backing up the registry. Had to
> > re-install oracle binaries.
> > Database was restored and online in 4.5 hours after the call was
> > reported - not great, not bad.
> > What did not take place was the re-scheduling of jobs run by the
> > operating system.
> > Without the scheduled jobs running - the database had not been shut down
> > before *cold* backups.
> > So those backups were worthless *hot* backups run without taking
> > tablespaces offline.
> > Without the scheduled jobs running - the daily export job had not
> > executed.
> > So the recovery options are from an export from January, before the
> > crash then.
> > Okay, we'll try to recover the database.
> > Startup mount - no problem. Can view all of the datafiles, status is
> > ONLINE.
> > Can view the online redo logs - all seem to be fine.
> > Alter database open - ORA-03113 - end of file on communication channel.
> > Core dump.
> > Attempted to mount and recover database - received mesage that no
> > recovery was needed.
> > Called oracle support.
> >
> > Opened a severity 1 TAR.
> >
> > Support stepped me through attempts to re-open and recover the database.
> > Still ORA-03113.
> > Got a full backup of all the existing files before they broke out the
> > jackhammer.
> > After exhausting all options, had to force open the database - which was
> > them corrupted.
> > I purposely forgot that init parameter used to force it open - I never
> > want to see it again.
> > Got most of the data out - still some was inaccessible - so recovery was
> > incomplete.
> > This event cost me more than 2 days of time that I didn't have.
> > Grabbed the compressed export files and imported them into a new
> > instance on my machine at work.
> > The crashed Server was rebuilt during this time - 2 RAID 1 volumes (new
> > RAID controller) - new OS install.
> >
> > ** What I really wanted to get across is this: **
> > If you're a relatively new to managing Oracle Databases - particularly
> > on WinNT - please understand this:
> >
> > Running all files on a single RAID 5 volume is extremely bad.
> > Log files and control files most certainly should not be stored on RAID
> > 5 volumes.
> > Swap space on RAID 5? Are you kidding?
> > (A well-tuned Oracle Instance won't be using the OS pagefile.sys at all
> > anyway)
> >
> > As someone else on the list once said: (to summarize)
> >
> > You're better off running JBOD (just a bunch of drives) that run only
> > RAID 5.
> > Maybe just mirror your OS and oracle binaries, control files, parameter
> > files.
> > Have the other drives set up as single drive RAID 0 volumes (or no
> > RAID).
> > Have a solid backup strategy in place, configure a disaster recovery
> > agent to avoid a bare metal recovery.
> > If the database is going to be at a remote site, use third party backup
> > utilities for hot backups.
> > Its not that hard to write the hot backup script - but it is more
> > difficult to restore from a home-grown script than to have a GUI in
> > front of the user that may be performing the recovery.
> > If you wrote the scripts to perform the hot backup - you *will* be
> > performing the recovery.
> > If its just a pre-configured restore job to run in a tool such as
> > Veritas NT Backup - even a Mac User could run it.
> >
> > If you get the chance to specify the box - use multiple RAID controllers
> > and DUPLEX across them.
> > When the machine loses a RAID controller - you can keep running until
> > the new one arrives, without even a hiccup.
> >
> > I haven't completely sworn off RAID 5 - I think that its a good option
> > compared with running RAID 0 for READ ONLY tablespaces. But for anything
> > that you have to write to - I would have to recommend against it.
> >
> > As far as recovery options running NOARCHIVELOG - there are 4:
> > recover from cold backup
> > recover from logical export
> > dice.com (dbajobs.com, etc.).
> > the 10K tool from Oracle.
> >
> > My ideal config uses 2 dual-channel RAID controllers, you have 4 I/O
> > channels - 2 internal and 2 external. The newer 5U rack mount storage
> > cabinets can contain up to 14 drives.
> > Just demand the "extra hardware".
> > Make sure that the backplanes are split - internal and external. Order
> > the extra cables needed.
> > Duplex all RAID volumes. Yes, you'll take a slight hit on throughput.
> > Big deal.
> > One more pair of drives would meet OFA standards (7 vols). Couldn't fit
> > it in this config.
> > So I put system on volume 0.
> >
> > Volume RAID Drives Size GB tablespaces Stores
> > 0 1 2 8.7 System OS, Oracle Binaries, Control File1
> > 1 1 2 8.7 4 online redo_logs, archlogs, export files
> > 2 1 2 8.7 RBS control file2
> > 3 1 2 8.7 TEMP control file3
> > 4 1 2 8.7 INDEX_DATA
> > 5 0+1 4+ 17.4 USER_DATA
> >
> > This config had 6 internal drives, 8 external drives - no hot spares.
> > I like the idea of having a pair of drives that are only writing
> > actively to the redo logs. (except for nightly exports).
> > This keeps the drive heads on the current redo log track - not searching
> > all over the drive for whatever block is asked of it.
> > If the drive heads are already on the right track, 1/10,000th of a
> > second isn't long to wait for a write, compared with a 7 ms avg seek
> > time.
> > With Ultra 160/m drives these days and 64 bit, 66 MHz PCI buses, access
> > times are the rate-limiting factor - not pure I/O throughput.
> > If you have a write-back cache enabled, its not such an issue - but I'm
> > still a little sceptical to enable that, even with a battery backup on
> > the controller card and a UPS feeding the server.
> >
> > One more thing - the entire GUI concpt usually lacks the most important
> > thing - a scripted way to reproduce the configuration that you just
> > made. If you are going to re-create from bare metal, you have to be able
> > to reproduce all of your Database's settings.
> > Don't use the GUI NT Resouce Kit scheduler for adding jobs - do it with
> > a script so that these jobs can be reproduced.
> > Recovery from a tape backup won't restore the scheduled jobs.
> >
> > drakonian.
> > --
> > Author: Paul Drake
> > INET: paled_at_home.com
> >
> > Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> > San Diego, California -- Public Internet access / Mailing Lists
> > --------------------------------------------------------------------
> > To REMOVE yourself from this mailing list, send an E-Mail message
> > to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
> > the message BODY, include a line containing: UNSUB ORACLE-L
> > (or the name of mailing list you want to be removed from). You may
> > also send the HELP command for other information (like subscribing).
>
> --
> Author: Chris Royce
> INET: Chris_at_Royce.net
>
> Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> San Diego, California -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
> the message BODY, include a line containing: UNSUB ORACLE-L
Received on Sun Jun 11 2000 - 20:25:34 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US