Re: Synopsis of a database crash and recovery (or time to bash RAID 5).

From: Chris Royce <Chris_at_Royce.net>
Date: Sun, 11 Jun 2000 20:29:55 -0400
Message-Id: <10525.108577@fatcity.com>

A brief comment --- I have been an ORACLE DBA for 14 years - I have supported a whole bunch of

releases  on all of the platforms::::::::::::::: Production :::::::::::::::: Mission Criticle ::::::

ETC ETC :::::::::::: Put it on some form of UNIX (First tier port) and then get concerned !! Sorry

Paul Drake wrote:

> This past week, an Oracle Database (v7.3.4 Workgroup) on WinNT Server
> 4.0 crashed at a remote Client Site. Database running NOARCHIVELOG.
> Single RAID 5 volume (4 drives), single hardware RAID controller.
> It was determined that the root cause of the crash was a faulty RAID
> controller - and that the volume was unavailable for read/write.
> That's where the problem seemingly started.
> Okay, not a huge deal yet, as we have 2 options for recovery - last cold
> backup, or import last full export (executed fresh daily).
> It turned out that the tape drive had failed weeks earlier - and no
> backups had been taken in quite some time.
> Uh oh. Okay, well - we still have the dump file, right?
> Wrong.
> In January this server had a catastrophic failure during a move - and
> had to be restored from tape.
> Backup was made with NTBackup - without backing up the registry. Had to
> re-install oracle binaries.
> Database was restored and online in 4.5 hours after the call was
> reported - not great, not bad.
> What did not take place was the re-scheduling of jobs run by the
> operating system.
> Without the scheduled jobs running - the database had not been shut down
> before *cold* backups.
> So those backups were worthless *hot* backups run without taking
> tablespaces offline.
> Without the scheduled jobs running - the daily export job had not
> executed.
> So the recovery options are from an export from January, before the
> crash then.
> Okay, we'll try to recover the database.
> Startup mount - no problem. Can view all of the datafiles, status is
> ONLINE.
> Can view the online redo logs - all seem to be fine.
> Alter database open - ORA-03113 - end of file on communication channel.
> Core dump.
> Attempted to mount and recover database - received mesage that no
> recovery was needed.
> Called oracle support.
>
> Opened a severity 1 TAR.
>
> Support stepped me through attempts to re-open and recover the database.
> Still ORA-03113.
> Got a full backup of all the existing files before they broke out the
> jackhammer.
> After exhausting all options, had to force open the database - which was
> them corrupted.
> I purposely forgot that init parameter used to force it open - I never
> want to see it again.
> Got most of the data out - still some was inaccessible - so recovery was
> incomplete.
> This event cost me more than 2 days of time that I didn't have.
> Grabbed the compressed export files and imported them into a new
> instance on my machine at work.
> The crashed Server was rebuilt during this time - 2 RAID 1 volumes (new
> RAID controller) - new OS install.
>
> ** What I really wanted to get across is this: **
> If you're a relatively new to managing Oracle Databases - particularly
> on WinNT - please understand this:
>
> Running all files on a single RAID 5 volume is extremely bad.
> Log files and control files most certainly should not be stored on RAID
> 5 volumes.
> Swap space on RAID 5? Are you kidding?
> (A well-tuned Oracle Instance won't be using the OS pagefile.sys at all
> anyway)
>
> As someone else on the list once said: (to summarize)
>
> You're better off running JBOD (just a bunch of drives) that run only
> RAID 5.
> Maybe just mirror your OS and oracle binaries, control files, parameter
> files.
> Have the other drives set up as single drive RAID 0 volumes (or no
> RAID).
> Have a solid backup strategy in place, configure a disaster recovery
> agent to avoid a bare metal recovery.
> If the database is going to be at a remote site, use third party backup
> utilities for hot backups.
> Its not that hard to write the hot backup script - but it is more
> difficult to restore from a home-grown script than to have a GUI in
> front of the user that may be performing the recovery.
> If you wrote the scripts to perform the hot backup - you *will* be
> performing the recovery.
> If its just a pre-configured restore job to run in a tool such as
> Veritas NT Backup - even a Mac User could run it.
>
> If you get the chance to specify the box - use multiple RAID controllers
> and DUPLEX across them.
> When the machine loses a RAID controller - you can keep running until
> the new one arrives, without even a hiccup.
>
> I haven't completely sworn off RAID 5 - I think that its a good option
> compared with running RAID 0 for READ ONLY tablespaces. But for anything
> that you have to write to - I would have to recommend against it.
>
> As far as recovery options running NOARCHIVELOG - there are 4:
> recover from cold backup
> recover from logical export
> dice.com (dbajobs.com, etc.).
> the 10K tool from Oracle.
>
> My ideal config uses 2 dual-channel RAID controllers, you have 4 I/O
> channels - 2 internal and 2 external. The newer 5U rack mount storage
> cabinets can contain up to 14 drives.
> Just demand the "extra hardware".
> Make sure that the backplanes are split - internal and external. Order
> the extra cables needed.
> Duplex all RAID volumes. Yes, you'll take a slight hit on throughput.
> Big deal.
> One more pair of drives would meet OFA standards (7 vols). Couldn't fit
> it in this config.
> So I put system on volume 0.
>
> Volume RAID Drives Size GB tablespaces Stores
> 0 1 2 8.7 System OS, Oracle Binaries, Control File1
> 1 1 2 8.7 4 online redo_logs, archlogs, export files
> 2 1 2 8.7 RBS control file2
> 3 1 2 8.7 TEMP control file3
> 4 1 2 8.7 INDEX_DATA
> 5 0+1 4+ 17.4 USER_DATA
>
> This config had 6 internal drives, 8 external drives - no hot spares.
> I like the idea of having a pair of drives that are only writing
> actively to the redo logs. (except for nightly exports).
> This keeps the drive heads on the current redo log track - not searching
> all over the drive for whatever block is asked of it.
> If the drive heads are already on the right track, 1/10,000th of a
> second isn't long to wait for a write, compared with a 7 ms avg seek
> time.
> With Ultra 160/m drives these days and 64 bit, 66 MHz PCI buses, access
> times are the rate-limiting factor - not pure I/O throughput.
> If you have a write-back cache enabled, its not such an issue - but I'm
> still a little sceptical to enable that, even with a battery backup on
> the controller card and a UPS feeding the server.
>
> One more thing - the entire GUI concpt usually lacks the most important
> thing - a scripted way to reproduce the configuration that you just
> made. If you are going to re-create from bare metal, you have to be able
> to reproduce all of your Database's settings.
> Don't use the GUI NT Resouce Kit scheduler for adding jobs - do it with
> a script so that these jobs can be reproduced.
> Recovery from a tape backup won't restore the scheduled jobs.
>
> drakonian.
> --
> Author: Paul Drake
> INET: paled_at_home.com
>
> Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> San Diego, California -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
> the message BODY, include a line containing: UNSUB ORACLE-L
Received on Sun Jun 11 2000 - 19:29:55 CDT