Re: Online backup: Backup online redologs?

From: Charles Fisher <Charles.Fisher_at_alcoa.com>
Date: Wed, 09 May 2001 15:38:36 GMT
Message-ID: <Pine.GSO.4.31.0105090923080.1269-100000@unknown>

On Wed, 9 May 2001, Howard J. Rogers wrote:

> It rather depends on the first member in each group (which I think is what
> your SQL is selecting for each time) never being corrupted, doesn't it?
> Suppose you get corruption in your first member -the database will run
> happily by using the second, but you've just backed up the non-functional
> member. How robust is that?

One has to assume that the normal status of most files is that they are not corrupt. However, read the caveat below.

Even so, corruption would not be the end of the world. I could simply pull the tape for the previous week, add all the archivelogs, and while the recovery would take longer, it would work if there weren't any problems with the redologs.

Honestly, I have had only one case in the last 6 months where I needed en emergency restore, and that was only the recovery of a datafile. Most would say that I am overreacting, but I feel that I need to be assured of complete recovery.

> > What if the server catches fire? What if some idiot spills a coke into the
> > disk array? What if we have a tornado and I see my server sailing past my
> > office window?
> > It might be a tad difficult to extract the online redo logs in that case.

> That's what clustering and disaster recovery procedures are for.

Just out of curiousity, do YOU have disaster recovery procedures? I have to wonder.

I mean, are you NFS-mounting a server in another city and writing one of your redo-log members there? This would pretty much be the only way of assuring recoverability (unless you are in a hardened facility, and even then you have to account for operator error).

No, you may not NFS-mount one of my servers. I bet you ask this a lot with your recovery methodology.

I suppose that the other solution to this problem is a standby server, but my typical systems have 60gig of Oracle datafiles, and I won't have the budget to duplicate that setup for at least a year, if I were to decide that I had to have it.

> > I ABSOLUTELY MUST have everything that I need to restore the database on a
> > tape that I can hold in my hand, and it must be from a hot backup.

> And you ABSOLUTELY can't. Not with 100% reliability, anyway.

And Oracle should feel absolutely ashamed about this state of affairs. Sybase doesn't have this problem.

> > If I
> > can't have that, then I don't want Oracle (and I don't know why anybody
> > else would).
> > I don't know why there isn't a furious uproar of people asking the same
> > set of questions about this that I am. I just don't understand it.

> I agree with that last sentence. It's *what* you don't understand that
> worries me.

Like I said, I have to wonder about about YOUR disaster recovery procedures. If I were your manager, I would not treat you kindly on this issue.

> > I need to clone from time to time from our production to development
> > systems. My backup method works great double-duty.

> It's inherently flawed, and the fact that you've got away with it in the
> past does not make it a reliable method. We've been round this one before.

> > p.s. There is actually a case where I could get stung pretty badly by this
> > procedure. If when I start copying the redo logs, the active log group is
> > the LAST log group, and it is nearly full, and granted that I get a
> > complete copy of the first log group, but also granting that before I get
> > to the last log group, a log switch occurs, then there will be no active
> > log group in the backup set, and my goose will be cooked when I try to
> > restore. However, things are pretty quiescent when I am doing this,

> Ah! I knew there had to be a reason for you getting away with it in the
> past, and this is it.
> The problem with copying hot redo logs (in fact, that should be singular,
> because there's only ever one group that's hot, of course) is simply that
> the operating system will grab chunks to be copied at random -whatever
> happens to swing into view as the platter rotates. If that chunk happens to
> be the bit getting currently written to, your copy will be internally
> corrupt, and never mind 'backup markers' or anything else. The thing will
> simply be unreadable. I presume you do not have a mechanism to guarantee
> which bits of a file the O/S should decide to copy at anyone time?
> Therefore, your hot copying method is intrinsically fallible. The one thing
> that is saving you is that "things are pretty quiescent" -in other words,
> you try to make sure that no-one is *writing* to the file. Guess what
> taking a cold backup does? Er, that's right: makes sure (only, 100% sure
> this time) that no-one is writing to the file.

Well, let's think about this situation for a moment.

LGWR moves through the redologs with a certain "velocity." The path is linear, or near-linear. The "velocity" is determined by the number of in-flight transactions; no DML means zero velocity (and there is no DDL, since DDL is really only DML on the data dictionary).

Recovery can normally be achieved from instance crash or shutdown abort when incomplete data is written to the redolog files. Upon this much we must agree.

There are two cases where we can be assured that recovery will NOT be possible:

SCNs appearing in the datafiles which are not in the redologs.
Zero or multiple active redologs produced by the redolog hotbackup.

If we can assure that there is only a single active thread in the hotbackup of the redologs, then recovery MIGHT be possible, depending upon how closely the redolog resembles a "point" failure (which we must agree is almost always recoverable), where the recording of a transaction was suddenly stopped (bearing in mind that I have no idea of the internal structure of these files).

This gets back to the question of "velocity." Oracle will be recording transactions in the redologs, but at the same time, LGWR will be communicating via UNIX IPC with the rest of the Oracle servers, coordinating activity.

However, the UNIX "cp" process will not be coordinating activity with other processes; it spends most of its time in IO system calls. We must assume that under most normal circumstances, the "velocity" of cp through the files is much greater than LGWR's, and that on a system with no in-flight transactions, the hot backup of the active log group will closely resemble a "point" failure.

The risk here is if cp (the reader) and LGWR (the writer) move with similar "velocity" over a span of several transactions. I assume that if this is the case, then recovery is in jeopardy. We must assume that if the intersection of these two processes covers only a single transaction, that recovery is assured.

There is also an additional risk introduced by multiple redolog members. My books say that LGWR will write to the member that is most highly available, implying that the redolog on the busier filesystem might not be structurally intact at all times (especially via NFS). It would help to know if LGWR does something like a "checkpoint," enforcing redolog member consistency on transaction boundaries.

However, even if several transactions are corrupted in this way, we are guaranteed that these SCNs will not appear in the datafiles (of the hotbackup), so recovery might still be possible.

Obviously, a definitive answer on these questions from a responsible Oracle developer would certainly make many people happier. However, I have since heard directly from Oracle support that Oracle does not support ANY recovery methods (to shield themselves from liability). Hey, at least I found somebody who was willing to be honest. He did mention that I might have trouble with a "DBID" when restoring the redologs, but I don't know what that is.

In any case, it seems to me that it is worthwhile to take hotbackups of the redologs, given Oracle's inability to perform complete recovery without them. This probably explains why so many people adhere to the practice when it is directly discouraged.

My script does exhibit a certain flaw in this respect - it compresses the redologs as it copies them, thus reducing the cp "velocity" and increasing the risk of a log switch. I will rectify the flaw.

Charles J. Fisher - Consultant
Alcoa Davenport Works
(319) 459-2512 Received on Wed May 09 2001 - 10:38:36 CDT