Re: Oracle backups using Snapshot Technology

From: Mark Brinsmead <pythianbrinsmead_at_gmail.com>
Date: Thu, 9 Nov 2006 18:04:06 -0700
Message-ID: <cf3341710611091704l2bb505a2j38da796fe971841b@mail.gmail.com>

Comments inline:

On 11/9/06, Hameed, Amir <Amir.Hameed_at_xerox.com> wrote:
>
> ... "until the split is sent to tape, there is no good
> backup because a disk failure in the primary backup storage can destroy
> the entire snapshot". Even though this is true but there are ways to
> protect the online backup mirror by mirroring it with 1+0 or 0+1. It is
> certainly not a cheap solution but there is no guarantee that a tape
> will not go bad after the snapshot is copied to the tape....

Sure tapes fail. That is precisely why most "enterprise" backup solutions allow you to duplex (or multiplex) tapes. They not only assume that tapes can fail, but sensibly assume that they will, fail. Just as a good backup strategy must assume that RAID-1 storage not only can, but will, fail.

Please bear in mind that I made that quoted comment in the context of people who use a "snapshot" volume as their only backup, that is, they never write the contents of the snapshot volume to tape. (Yes, such sites, do exist, as horrible as this is to contemplate.) And besides no level of RAID on the snapshot volume will protect you from failure of the primary media until after the snapshot has "hardened". (Some types of snapshot never "harden", by the way.)

Of course, at "sane" sites, this is a non issue. So the "primary" storage fails while I'm in the middle of writing the snapshot volume to tape? Big deal. I still have yesterday's backups (on duplexed tapes), and all of the archive logs. And my datafiles, archive logs, and online redo never share common (physical) spindles, so no matter which RAID-10 volume failed, I can still recover right up to the last committed transaction. At "sane" sites. Alas, I haven't actually seen one of those for a while... :-(

(For some reason, it no longer seems fashionable to place datafiles, online redo, and archived redo on disjoint sets of disks. It almost seems that people have forgotten that we do that for data protection purposes, not for performance purposes...)

Some people might think I am paranoid about data protection. But those people have probably never experienced a situation where 30 disks out of a set of 200 failed simultaneously. (Yes, I have seen that, and lived to talk about it.) Even with RAID-10, you have to be pretty lucky to come though an incident like that unscathed. We didn't get through without downtime, in part because we had RAID-0+1 instead of RAID-10 so the odds against us were astronimical. But we survived, and didn't lose a single committed transaction. Paranoid? Maybe, but events like this end careers. And corporations. It pays to protect against them.

By the way, don't ask about the RAID-0+1 -- it was the best available technology at the time... ;-)

-- 
Cheers,
-- Mark Brinsmead
   Senior DBA,
   The Pythian Group
   http://www.pythian.com/blogs

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Nov 09 2006 - 19:04:06 CST