Re: Online backup: Backup online redologs?

From: Howard J. Rogers <howardjr_at_www.com>
Date: Thu, 10 May 2001 10:47:49 +1000
Message-ID: <3af9e53c@news.iprimus.com.au>

"Charles Fisher" <Charles.Fisher_at_alcoa.com> wrote in message news:Pine.GSO.4.31.0105090923080.1269-100000_at_unknown...
> On Wed, 9 May 2001, Howard J. Rogers wrote:
>
> > It rather depends on the first member in each group (which I think is
what
> > your SQL is selecting for each time) never being corrupted, doesn't it?
> > Suppose you get corruption in your first member -the database will run
> > happily by using the second, but you've just backed up the
non-functional
> > member. How robust is that?
>
> One has to assume that the normal status of most files is that they are
> not corrupt.

This is funny! We're talking backup and recovery here. You don't make any assumptions (not if you're doing your job properly), and you adopt a backup strategy that is reliable and robust (ditto). The script you posted starts off by making a howler of an assumption, and it's all downhill from there. Your backup strategy is founded on dodgy assumptions and false premises. The fact that you've got away with it at all is something of a wonder.

>However, read the caveat below.
>
> Even so, corruption would not be the end of the world. I could simply pull
> the tape for the previous week, add all the archivelogs, and while the
> recovery would take longer, it would work if there weren't any problems
> with the redologs.
>

Charles, have you any idea what you're posting. Read my lips: your script assumes the REDO LOGS are not corrupt. If they are, you're little script is about as useful as a hole in the head.

Why you start yacking on about how to perform a perfectly straightforward complete recovery of a *datafile*, I have no idea.

> Honestly, I have had only one case in the last 6 months where I needed en
> emergency restore, and that was only the recovery of a datafile. Most
> would say that I am overreacting, but I feel that I need to be assured of
> complete recovery.
>

Your script guarantees that one day, you won't be able to perform a complete recovery.

> > > What if the server catches fire? What if some idiot spills a coke into
the
> > > disk array? What if we have a tornado and I see my server sailing past
my
> > > office window?
> > > It might be a tad difficult to extract the online redo logs in that
case.

> > That's what clustering and disaster recovery procedures are for.
>
> Just out of curiousity, do YOU have disaster recovery procedures? I have
> to wonder.
>
> I mean, are you NFS-mounting a server in another city and writing one of
> your redo-log members there?

I think you'll find that Oracle won't even permit that. And it would be a very silly thing to do even if it were to permit it, as you well know, because performance would be a nightmare.

>This would pretty much be the only way of
> assuring recoverability (unless you are in a hardened facility, and even
> then you have to account for operator error).
>
> No, you may not NFS-mount one of my servers. I bet you ask this a lot
> with your recovery methodology.
>

Charles, for the last time: this is not "my" recovery methodology. There is only one recovery methodology that is guaranteed to work, and it's Oracle's.

> I suppose that the other solution to this problem is a standby server, but
> my typical systems have 60gig of Oracle datafiles, and I won't have the
> budget to duplicate that setup for at least a year, if I were to decide
> that I had to have it.
>

No, a standby database doesn't guarantee total data recoverability either, because it relies on the transfer of archives -and yes, there's always one redo log that hasn't been archived yet.

> > > I ABSOLUTELY MUST have everything that I need to restore the database
on a
> > > tape that I can hold in my hand, and it must be from a hot backup.

> > And you ABSOLUTELY can't. Not with 100% reliability, anyway.
>
> And Oracle should feel absolutely ashamed about this state of affairs.
> Sybase doesn't have this problem.
>

I have to say that that simply is not true.

> > > If I
> > > can't have that, then I don't want Oracle (and I don't know why
anybody
> > > else would).
> > > I don't know why there isn't a furious uproar of people asking the
same
> > > set of questions about this that I am. I just don't understand it.

> > I agree with that last sentence. It's *what* you don't understand that
> > worries me.
>
> Like I said, I have to wonder about about YOUR disaster recovery
> procedures. If I were your manager, I would not treat you kindly on this
> issue.
>

Does *your* manager know what you're doing with his system? Does he know you are flouting every backup and recovery rule in the book on the grounds that you've got away with it in the past? Does he know that you don't appear to have the distinction between backup and cloning clear in your head?

[Snip]

> > The problem with copying hot redo logs (in fact, that should be
singular,
> > because there's only ever one group that's hot, of course) is simply
that
> > the operating system will grab chunks to be copied at random -whatever
> > happens to swing into view as the platter rotates. If that chunk
happens to
> > be the bit getting currently written to, your copy will be internally
> > corrupt, and never mind 'backup markers' or anything else. The thing
will
> > simply be unreadable. I presume you do not have a mechanism to
guarantee
> > which bits of a file the O/S should decide to copy at anyone time?
> > Therefore, your hot copying method is intrinsically fallible. The one
thing
> > that is saving you is that "things are pretty quiescent" -in other
words,
> > you try to make sure that no-one is *writing* to the file. Guess what
> > taking a cold backup does? Er, that's right: makes sure (only, 100%
sure
> > this time) that no-one is writing to the file.
>
> Well, let's think about this situation for a moment.
>
> LGWR moves through the redologs with a certain "velocity." The path is
> linear, or near-linear. The "velocity" is determined by the number of
> in-flight transactions; no DML means zero velocity (and there is no DDL,
> since DDL is really only DML on the data dictionary).
>
> Recovery can normally be achieved from instance crash or shutdown abort
> when incomplete data is written to the redolog files. Upon this much we
> must agree.
>
> There are two cases where we can be assured that recovery will NOT be
> possible:
>
> 1. SCNs appearing in the datafiles which are not in the redologs.
> 2. Zero or multiple active redologs produced by the redolog hotbackup.
>

And 3. I've internally corrupted my current redo log by attempting to do things to it which all Oracle documentation, all Oracle support personnel and the Wizard of Oz told me not to do -ie, back it up hot.

> If we can assure that there is only a single active thread in the
> hotbackup of the redologs, then recovery MIGHT be possible, depending upon
> how closely the redolog resembles a "point" failure (which we must agree
> is almost always recoverable), where the recording of a transaction was
> suddenly stopped (bearing in mind that I have no idea of the internal
> structure of these files).
>
> This gets back to the question of "velocity." Oracle will be recording
> transactions in the redologs, but at the same time, LGWR will be
> communicating via UNIX IPC with the rest of the Oracle servers,
> coordinating activity.
>
> However, the UNIX "cp" process will not be coordinating activity with
> other processes; it spends most of its time in IO system calls. We must
> assume that under most normal circumstances, the "velocity" of cp through
> the files is much greater than LGWR's, and that on a system with no
> in-flight transactions, the hot backup of the active log group will
> closely resemble a "point" failure.
>

"We must assume". Really? No thanks.

> The risk here is if cp (the reader) and LGWR (the writer) move with
> similar "velocity" over a span of several transactions. I assume that if
> this is the case, then recovery is in jeopardy. We must assume that if the
> intersection of these two processes covers only a single transaction, that
> recovery is assured.
>
> There is also an additional risk introduced by multiple redolog members.
> My books say that LGWR will write to the member that is most highly
> available, implying that the redolog on the busier filesystem might not be
> structurally intact at all times (especially via NFS). It would help to
> know if LGWR does something like a "checkpoint," enforcing redolog member
> consistency on transaction boundaries.
>
> However, even if several transactions are corrupted in this way, we are
> guaranteed that these SCNs will not appear in the datafiles (of the
> hotbackup), so recovery might still be possible.
>
> Obviously, a definitive answer on these questions from a responsible
> Oracle developer would certainly make many people happier. However, I have
> since heard directly from Oracle support that Oracle does not support ANY
> recovery methods (to shield themselves from liability). Hey, at least I
> found somebody who was willing to be honest. He did mention that I might
> have trouble with a "DBID" when restoring the redologs, but I don't know
> what that is.
>
> In any case, it seems to me that it is worthwhile to take hotbackups of
> the redologs, given Oracle's inability to perform complete recovery
> without them. This probably explains why so many people adhere to the
> practice when it is directly discouraged.
>

Most people don't. They have far more sense.

> My script does exhibit a certain flaw in this respect - it compresses the
> redologs as it copies them, thus reducing the cp "velocity" and increasing
> the risk of a log switch. I will rectify the flaw.
>
>

I'm not discussing this issue with you anymore, Charles, I'm afraid. You have your head in the sand on this issue, your methodology sucks, and I've explained over and over *why*. And still you insist on a bunch of assumptions, parameters and ideas which have no bearing on the matter (Oracle guarantees complete recoverability of committed transactions, provided you retain all archives since the start of your last backup cycle and don't lose all members of the current redo log group, and that guarantee has sod all to do with "velocity" or any other iffy concepts you want to throw into the pot.) You're making a mountain out of a very simple molehill.

Regards
HJR

> Charles J. Fisher - Consultant
> Alcoa Davenport Works
> (319) 459-2512
>
>
Received on Wed May 09 2001 - 19:47:49 CDT