RE: RTO Challenges

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Tue, 27 Mar 2018 08:35:31 -0400
Message-ID: <028401d3c5c8$3324b0d0$996e1270$_at_rsiz.com>


JL's basically correct. I would add one more flavor: Alternating on-line backups.

With respect to the Recovery Time Objective (RTO) or the overall business continuation policy (which I prefer), you need to consider whether you are talking about:

  1. Logical corruption
  2. On "campus" physical storage loss
  3. A need to continue business operations somewhere remote due to a disaster

Standbys tend to ignore #1. And there is an infinite resource requirement problem extending a logical corruption problem backwards in time. For example, if there is a logical corruption that is not discovered for a month, would you still have the backup image of the files available to roll forward using a corrected application and do you have the logical transaction data available to apply? Probably not, because most folks don't write re-playable applications anymore. So usually that is considered separately from physical recovery.

In alternating on-line backups, you place enough storage for n copies of your data in cabinets with isolated failure components at the volume level. If the size of your largest individual storage volume is m, then the minimum size requirement for alternating on-line backups is 2n+m. Anything less than 3n requires some thinking and planning.

In this model you never reload volumes but simply point the logical name used by the database at the most recent generation of backup of the volume or volume(s) containing tablespaces that need to be rolled forward. Then your time to recover is converted from your time to reload files to your time to apply redo to bring files current or to a consistent point in time. (which you still need to measure considering peak transaction volume.)

At 3n you essentially have backup set "a" and backup set "b", so backup set "a" is still on line and separately failing from backup set "b" while backup set "b" is being made.

At less than 3n but at least 2n+m, you have to do a rotating shell game so that there is always a distinct copy of every volume available while its new backup is being made.

If applications have their storage volume sets segregated, then they can vary in number and frequency of recoverable volume sets. (You can't just let Oracle allocate more from the pool for you in that model.) Then you can arrange for different times to business continuation for different applications and a priority for restoring them to operation.

This practice is obviously expensive in storage, but it was all we had prior to realizing you could do continuous redo application on a separate machine seeded with copies of the database files (which eventually spawned dataguard) and there are quite a few strategies for standbys that should meet your 2 hour requirement. Of course storage volume costs have dropped compared to license costs, so alternation might fit your cost objectives.

Since then (circa 1990), full volume continuous syncing software has also become practical. Mladen might want to chime in on that.

Good luck.

-----Original Message-----

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Jonathan Lewis
Sent: Tuesday, March 27, 2018 6:14 AM
To: ORACLE-L; dombrooks_at_hotmail.com
Subject: Re: RTO Challenges

I think the answer comes in two parts.

There are companies that haven't done a proper analysis of RTO for different disasters and haven't considered simple time calculations for time to restore, volume to recover.

Companies who have done the analysis don't expect to recover "sufficiently large" databases to the original machine but use a standby strategy that allows minimum file shipping and recovery .

Regards
Jonathan Lewis



From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> on behalf of Dominic Brooks <dombrooks_at_hotmail.com> Sent: 27 March 2018 10:51:47
To: ORACLE-L
Subject: RTO Challenges

I'm not a DBA as such and I've always skipped over most of the chapters on RMAN etc so very grateful for expert opinions on this please.

  1. We have multi-TB DBs, as everyone does.
  2. The message from DBA policies is that we can only restore at 2 TB per hour.
  3. We have an RTO of 2 hours

As a result, there is a wide initiative pushed down onto application teams that there is therefore an implied 4TB limit to any of the critical applications' databases, in the event that we run into those scenarios where we need to restore from backup.

Initially, the infrastructure-provided solution was ZDLRA, for which our firm's implementation thereof was initially promising a 4TB per hour restore rate but in practice is delivering the above 2TB per hour restore rate, and this is the figure used to the DBAs as a key input into this generic firm-wide policy.

My thoughts are that this is still an infrastructure issue and there are probably plenty of alternative infrastructure solutions to this problem. But now it is being presented as an application issue. Of course applications should all have hygienic practice in place around archiving and purging, whilst also considering regulatory requirements around data retention, etc, etc.

But it seems bizarre to me to have this effective database size limit in practice and I'm not aware of this approach above being common practice. 4TB is nothing by today's standards.

Am I wrong?
What different approaches / solutions could be used?

Thanks

Regards,
Dominic

--

http://www.freelists.org/webpage/oracle-l

--

http://www.freelists.org/webpage/oracle-l Received on Tue Mar 27 2018 - 14:35:31 CEST

Original text of this message