Re: Synopsis of a database crash and recovery (or time to bash

From: <rsands_at_lendleaserei.com>
Date: Mon, 12 Jun 2000 09:52:24 -0400
Message-Id: <10526.108647@fatcity.com>

I inherited a production, 'mission critical' (according to damagement) database 'running' on an NT server connected to an EMC box. We kept experiencing corrupt blocks, and after many q&a sessions with the sys admin, I discovered that his predecessor had set up the box so that the database, the Oracle software and the NT operating system were all actually running off the EMC box. (Have no idea what the drives actually on the server were doing.) The EMC box also happened to be located two floors away, and was connected to the server by a cable that wasn't real consistent. I ran nightly hot backups and exports, both of which were used frequently until the sys admin could reconfigure the box. (but still not unix)

NT databases can be dangerous. Not only is the os less reliable, (IMHO) the ease of setup can create a false sense of security/ competence and the defaults are set to keep things simple, not always the best approach. I agree with all the 'lessons' that have been posted: stay on unix when possible, configure carefully, run frequent backups, monitor them closely and test frequently. Also, be certain that you and the sys admin really understand each other when discussing what's actually behind a mount point!!

Robyn Sands
Lend Lease REI
(a former Californian, working for an Australian company, located in Atlanta, GA)

"Rachel Carmichael" <carmichr_at_hotmail.com> on 06/11/2000 11:03:58 PM

Please respond to ORACLE-L_at_fatcity.com

To: Multiple recipients of list ORACLE-L <ORACLE-L_at_fatcity.com> cc: (bcc: Robyn Sands/US1/Lend Lease)

Subject: Re: Synopsis of a database crash and recovery (or time to bash RAID

5).

Paul,

having gone through a somewhat similar experience, my heart goes out to you. I do have one question though:

how come NO ONE noticed that the exports weren't being done, that the backups were being done with an open database and that the tape drive was gone and no backups were being done?

You said that the server crashed in January and that exports and shutdown of database before backup had therefore not been done since then. HOW COME NO

ONE NOTICED????????????????????????????????  We are talking over 5 months

here.

Rachel

>From: Paul Drake <paled_at_home.com>
>Reply-To: ORACLE-L_at_fatcity.com
>To: Multiple recipients of list ORACLE-L <ORACLE-L_at_fatcity.com>
>Subject: Synopsis of a database crash and recovery (or time to bash RAID
>5).
>Date: Sun, 11 Jun 2000 16:54:06 -0800
>
>This past week, an Oracle Database (v7.3.4 Workgroup) on WinNT Server
>4.0 crashed at a remote Client Site. Database running NOARCHIVELOG.
>Single RAID 5 volume (4 drives), single hardware RAID controller.
>It was determined that the root cause of the crash was a faulty RAID
>controller - and that the volume was unavailable for read/write.
>That's where the problem seemingly started.
>Okay, not a huge deal yet, as we have 2 options for recovery - last cold
>backup, or import last full export (executed fresh daily).
>It turned out that the tape drive had failed weeks earlier - and no
>backups had been taken in quite some time.
>Uh oh. Okay, well - we still have the dump file, right?
>Wrong.
>In January this server had a catastrophic failure during a move - and
>had to be restored from tape.
>Backup was made with NTBackup - without backing up the registry. Had to
>re-install oracle binaries.
>Database was restored and online in 4.5 hours after the call was
>reported - not great, not bad.
>What did not take place was the re-scheduling of jobs run by the
>operating system.
>Without the scheduled jobs running - the database had not been shut down
>before *cold* backups.
>So those backups were worthless *hot* backups run without taking
>tablespaces offline.
>Without the scheduled jobs running - the daily export job had not
>executed.
>So the recovery options are from an export from January, before the
>crash then.
>Okay, we'll try to recover the database.
>Startup mount - no problem. Can view all of the datafiles, status is
>ONLINE.
>Can view the online redo logs - all seem to be fine.
>Alter database open - ORA-03113 - end of file on communication channel.
>Core dump.
>Attempted to mount and recover database - received mesage that no
>recovery was needed.
>Called oracle support.
>
>Opened a severity 1 TAR.
>
>Support stepped me through attempts to re-open and recover the database.
>Still ORA-03113.
>Got a full backup of all the existing files before they broke out the
>jackhammer.
>After exhausting all options, had to force open the database - which was
>them corrupted.
>I purposely forgot that init parameter used to force it open - I never
>want to see it again.
>Got most of the data out - still some was inaccessible - so recovery was
>incomplete.
>This event cost me more than 2 days of time that I didn't have.
>Grabbed the compressed export files and imported them into a new
>instance on my machine at work.
>The crashed Server was rebuilt during this time - 2 RAID 1 volumes (new
>RAID controller) - new OS install.
>
>** What I really wanted to get across is this: **
>If you're a relatively new to managing Oracle Databases - particularly
>on WinNT - please understand this:
>
>Running all files on a single RAID 5 volume is extremely bad.
>Log files and control files most certainly should not be stored on RAID
>5 volumes.
>Swap space on RAID 5? Are you kidding?
>(A well-tuned Oracle Instance won't be using the OS pagefile.sys at all
>anyway)
>
>As someone else on the list once said: (to summarize)
>
>You're better off running JBOD (just a bunch of drives) that run only
>RAID 5.
>Maybe just mirror your OS and oracle binaries, control files, parameter
>files.
>Have the other drives set up as single drive RAID 0 volumes (or no
>RAID).
>Have a solid backup strategy in place, configure a disaster recovery
>agent to avoid a bare metal recovery.
>If the database is going to be at a remote site, use third party backup
>utilities for hot backups.
>Its not that hard to write the hot backup script - but it is more
>difficult to restore from a home-grown script than to have a GUI in
>front of the user that may be performing the recovery.
>If you wrote the scripts to perform the hot backup - you *will* be
>performing the recovery.
>If its just a pre-configured restore job to run in a tool such as
>Veritas NT Backup - even a Mac User could run it.
>
>If you get the chance to specify the box - use multiple RAID controllers
>and DUPLEX across them.
>When the machine loses a RAID controller - you can keep running until
>the new one arrives, without even a hiccup.
>
>I haven't completely sworn off RAID 5 - I think that its a good option
>compared with running RAID 0 for READ ONLY tablespaces. But for anything
>that you have to write to - I would have to recommend against it.
>
>As far as recovery options running NOARCHIVELOG - there are 4:
> recover from cold backup
> recover from logical export
> dice.com (dbajobs.com, etc.).
> the 10K tool from Oracle.
>
>My ideal config uses 2 dual-channel RAID controllers, you have 4 I/O
>channels - 2 internal and 2 external. The newer 5U rack mount storage
>cabinets can contain up to 14 drives.
>Just demand the "extra hardware".
>Make sure that the backplanes are split - internal and external. Order
>the extra cables needed.
>Duplex all RAID volumes. Yes, you'll take a slight hit on throughput.
>Big deal.
>One more pair of drives would meet OFA standards (7 vols). Couldn't fit
>it in this config.
>So I put system on volume 0.
>
>Volume RAID Drives Size GB tablespaces Stores
>0 1 2 8.7 System OS, Oracle Binaries, Control File1
>1 1 2 8.7 4 online redo_logs, archlogs, export files
>2 1 2 8.7 RBS control file2
>3 1 2 8.7 TEMP control file3
>4 1 2 8.7 INDEX_DATA
>5 0+1 4+ 17.4 USER_DATA
>
>This config had 6 internal drives, 8 external drives - no hot spares.
>I like the idea of having a pair of drives that are only writing
>actively to the redo logs. (except for nightly exports).
>This keeps the drive heads on the current redo log track - not searching
>all over the drive for whatever block is asked of it.
>If the drive heads are already on the right track, 1/10,000th of a
>second isn't long to wait for a write, compared with a 7 ms avg seek
>time.
>With Ultra 160/m drives these days and 64 bit, 66 MHz PCI buses, access
>times are the rate-limiting factor - not pure I/O throughput.
>If you have a write-back cache enabled, its not such an issue - but I'm
>still a little sceptical to enable that, even with a battery backup on
>the controller card and a UPS feeding the server.
>
>One more thing - the entire GUI concpt usually lacks the most important
>thing - a scripted way to reproduce the configuration that you just
>made. If you are going to re-create from bare metal, you have to be able
>to reproduce all of your Database's settings.
>Don't use the GUI NT Resouce Kit scheduler for adding jobs - do it with
>a script so that these jobs can be reproduced.
>Recovery from a tape backup won't restore the scheduled jobs.
>
>drakonian.
>--
>Author: Paul Drake
> INET: paled_at_home.com
>
>Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
>San Diego, California -- Public Internet access / Mailing Lists
>--------------------------------------------------------------------
>To REMOVE yourself from this mailing list, send an E-Mail message
>to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
>the message BODY, include a line containing: UNSUB ORACLE-L
>(or the name of mailing list you want to be removed from). You may
>also send the HELP command for other information (like subscribing).

Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

--
Author: Rachel Carmichael
  INET: carmichr_at_hotmail.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in

Received on Mon Jun 12 2000 - 08:52:24 CDT