Re: Minimize recovery time

From: Tim Gorman <tim.evdbt_at_gmail.com>
Date: Wed, 27 Apr 2022 10:59:30 -0700
Message-ID: <d46b462d-c3de-3667-104d-499958d88a4e_at_gmail.com>



Lok,

I saw in another response that you are indeed using DataGuard to a different data center.  Related question: have you tested DataGuard failover in production?

It is important to lay out all known or possible outage scenarios on a whiteboard, then collaboratively decide best responses to each scenario.  In other words, scenario XXX is best handled with a DataGuard failover, scenario YYY is best handled with a full database restore/recovery, scenario ZZZ is best handled with a DataGuard failover, etc.  Of course, this is information that should be incorporated into the organization's run book.

Following this exercise, it usually becomes apparent that most outage scenarios are best handled by DataGuard, and that there are likely very few outage scenarios which are to be resolved by a full database restore/recover from backup.  This might lead to decisions to add new recovery mechanisms, change database platform, or it might even lead to negotiations with application users for new RTO expectations, perhaps dependent on outage scenarios?

The Exadata platform appears to support snapshots for cloning purposes, but does not appear to support storage-level snapshots for backup purposes.  Such a backup solution would need to copy snapshot images out to an immutable "recovery vault" in separate storage, and of course be capable of restoring from the "recovery vault" back to the database storage.

Outside of Exadata, storage-level snapshots are usually the only method to avoid the laws of physics inherent with a large database, because it always takes time and resources to copy bytes from HERE to THERE, and such /streaming /to tape (originally) or disk is what RMAN is based upon.  Tricks like incremental backups certainly reduce the amount of time and resources to perform a backup, but they do not reduce the amount of time to restore; such tricks usually increase the amount of time to restore.  RMAN does have capabilities for partial recoveries, by tablespace or datafile, in some cases right down to the block, but each team should evaluate in the whiteboard exercise above how often such capabilities are likely to be used.  In my personal experience as a production Oracle DBA since 1993, I have been able to restore individual datafiles perhaps once or twice, individual tablespaces never, and individual blocks never.  Of course, YMMV.

So it may be useful to evaluate whether the application is actually using Exadata smart-scan features to justify staying on the platform.  Not many people venture down to the bottom of modern AWR reports to review the sections entitled *Top Databases by IO Throughput*...

Of course, I've anonymized the DB Name and DBID values, but the rest are untouched.  This table shows which of the databases use the Exadata storage cells most, and clearly we see that it is the database named PROD01.  Elsewhere in the AWR report, we may find that PROD01 uses a lot of smart scan features, so apparently it is the database for which the Exadata was originally purchased.

But what about databases PROD02 through PROD08 (or beyond)?  Should these other databases have been consolidated onto the Exadata?  Are they getting any benefit from it?  Or, are they in fact getting worse performance because their SGA is being kept smaller in order to encourage cell-offloading?  Would they do better if moved off Exadata to an alternate platform larger SGAs are possible and storage-level snapshots can more effectively meet the RTO requirements?

Hopefully this discussion is helpful?

Thanks!

-Tim

On 4/27/2022 2:49 AM, Lok P wrote:
>
> Hello Listers, We have an Oracle Exadata (X7) database with 12.1.0.2.0
> and it's now grown up to size 12TB now. As per client agreement and
> criticality of this application the RTO(Recovery time objective) has
> to be within ~4hrs. The team looking after the backup recovery has
> communicated the RTO(recovery time objective) as ~1hrs for ~2TB of
> data with current infrastructure. So going by that, this current size
> of the database will have RTO ~6hrs which is more than the client
> agreement(which is ~4hrs).
>
> Going through the top space consumers, we see those are table/index
> sub-partitions and non partitioned indexes. Should we look into
> table/index compression here? But then i think there is also downside
> of that too on the DML performance.
>
> Wanted to understand Is there any other option to get this achieved
> (apart from exploring possible data purge) to have this RTO faster or
> under the service agreement? How should we approach.
>
> Regards
>
> Lok
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Apr 27 2022 - 19:59:30 CEST

Original text of this message