RE: Minimize recovery time

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Wed, 27 Apr 2022 14:25:55 -0400
Message-ID: <067f01d85a64$3c6633f0$b5329bd0$_at_rsiz.com>

Maintain two recoveries with dataguard or roll your own.

Lag application of logs on one of them long enough that you will be able to detect a cyber attack. (The immediate apply one is for routine switch or fail over with approximately no delay.)

You will need space to apply the hold and apply the relevant unapplied logs. Unless the size of your generated logs is huge, then coming back up from your outage you will be able to apply many days of logs in 2 or 3 hours.

Application of logs happens in order without any of the concurrency delays and multi-user issues of the original jobs (and without all those report queries taking up time). Roll forward will pretty much take your breath away, especially if you have space for the logs in the lag period still on whatever "class A" storage is for your environment.

I suppose someone could make an argument that someone could attack your log area, so everything must be air gapped for recovery. That's the argument for alternating backups on at least 2 SANS that only one is plugged in at a time, and you copy all your logs there as well as of the time of backup and copy your logs after that point to an outbound only pipe that requires action to mount.

good luck.

From: Karthikeyan Panchanathan [mailto:keyantech_at_gmail.com] Sent: Wednesday, April 27, 2022 12:32 PM To: loknath.73_at_gmail.com; Mark W. Farnham; Andy Sayer Cc: Oracle L
Subject: Re: Minimize recovery time

In our case we had many old(history) for compliance data with longer retention policy. According to me this data is using DB as storage.

Worked with Compliance and Business to push History data into 1 schema/1 tablespace then export as data dump. Archive in Tape with same data retention policy.

Once exported that data purged to reduce DB size. We we able to bring RTO under 3 hours.

It worked in our scenario. Sharing here if that any helpful.

Karth

Get Outlook for iOS <https://aka.ms/o0ukef>

From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> on behalf of Lok P <loknath.73_at_gmail.com> Sent: Wednesday, April 27, 2022 11:53:34 AM To: Mark W. Farnham <mwf_at_rsiz.com>; Andy Sayer <andysayer_at_gmail.com> Cc: Oracle L <oracle-l_at_freelists.org> Subject: Re: Minimize recovery time

Just checked, its really not for guarding a multi location disaster but rather , in case of any cyber attack if the data is corrupted in primary the same will be propagated to secondary/dataguard site. So in that case we will need to rely on the backup/recovery process RTO.

Also with regards to table/index compression. We are seeing in another database the table compression with 'compress for query high' option is decreasing the size of the data to 1/3rd of the original size. So is it safe to go for this compression as an initial approach and test this OLTP application against it? But for index it appear to be only Key compression , so we need to carefully see the non unique index if any. and what storage space benefit are we getting out of it. Correct me if wrong.

On Wed, 27 Apr 2022, 7:08 pm Mark W. Farnham, <mwf_at_rsiz.com> wrote:

If the business requirement is truly for multiple site disasters still providing business continuation you have a difficult task.

First, you should try to gain an understanding amongst the stakeholders that IF you are guarding against multiple data center disasters (otherwise a dataguard or a remote standby catch-up and fail-over seems sufficient), that implies you have a third repository of the data far away from the first two, most likely with an agreement with a third party to spin up hardware to recover on at their site.

Very likely they will then understand that your current set up for failover is sufficient for the requirement.

IF I am wrong about that, then the most likely solution is to introduce time based partitioning of all the data that in fact has a date after which it is not allowed to be changed AND is not required for the operations that must be available for business continuation. (Rarely are old transaction histories required with the same immediacy as current inventory quantities, and so forth).

IF sufficient data meeting those characteristics can be identified that will permanently keep you within your physical reload recovery window, then you also need to be in a position to shuffle (probably shrinking down free space and permanently doing useful attribute clustering) partitions to "slower recovery okay" tablespaces.

Then you can practice the plan to bring up immediately only those tablespaces required for operations that need the stated business continuation immediacy (continuing the reload and recovery of the other tablespaces after business of the critical functions resumes.)

Avoiding this entire race dance is the point of online recovery mechanisms: Modern systems, often on SSD, quite often grow far too large to "back up" onto more persistent storage in terms of the ability to read that persistent storage back onto storage connected to your machine.

Another possible way to do this is to plug multiple SANs into your machine(s). This, of course, does not handle the multiple site disaster problem. You don't keep any "current" data on the alternated SANs (of which you have a minimum of two), because you never start overwriting your only complete backup file set. After a backup is complete, the relevant SAN is physically disconnected.

Then, in a "storage disaster", after you clean up the host from the software hack that was likely the cause, you connect your most recent backup SAN and away you go.

Not all machines have connections for plugging in multiple SANs, and of course you can't make these SANs "virtual" storage. You're unplugging one to make it air gapped from attack. You might have an air gapped machine to plug it into to run full surface scans and memory checks (SSD), but that entire set-up is non-networked.

When they balk at the cost, perhaps it is time to engage a certified actuary to explain to them what rare case they are insuring against (and probably don't have all the things they need to in order to make the plan possibly succeed.)

And, of course, any plan you have that you don't test regularly is just wishful thinking. Testing plug-in replacement storage is probably a bigger risk than relying on something like dataguard or storage snapshots.

If they are worried about this, do they have multiple physically independent communications infrastructure? How about power generators?

Good luck,

mwf

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Lok P
Sent: Wednesday, April 27, 2022 7:21 AM
To: Andy Sayer
Cc: Oracle L
Subject: Re: Minimize recovery time

Yes they are on different datacenter and those too in different locations.

And backup is being taken in both primary and secondary through ZDLRA and i believe the respective backups must be kept in the respective data centre in their configured ZDLRA storage.

On Wed, 27 Apr 2022, 4:06 pm Andy Sayer, <andysayer_at_gmail.com> wrote:

Your dataguard is using the same storage as your primary? Usually it would be a whole different data centre. Where are your backups going?

On Wed, 27 Apr 2022 at 11:35, Lok P <loknath.73_at_gmail.com> wrote:

Yes we have dataguard setup , but this agreement is in place in case of both primary and dataguard DB fails because of disaster or corruption etc.

On Wed, 27 Apr 2022, 3:30 pm Andy Sayer, <andysayer_at_gmail.com> wrote:

Have you considered Dataguard? You'd have a secondary database always ready to failover to.

Thanks,

Andy

On Wed, 27 Apr 2022 at 10:50, Lok P <loknath.73_at_gmail.com> wrote:

Hello Listers, We have an Oracle Exadata (X7) database with 12.1.0.2.0 and it's now grown up to size 12TB now. As per client agreement and criticality of this application the RTO(Recovery time objective) has to be within ~4hrs. The team looking after the backup recovery has communicated the RTO(recovery time objective) as ~1hrs for ~2TB of data with current infrastructure. So going by that, this current size of the database will have RTO ~6hrs which is more than the client agreement(which is ~4hrs).

Going through the top space consumers, we see those are table/index sub-partitions and non partitioned indexes. Should we look into table/index compression here? But then i think there is also downside of that too on the DML performance.

Wanted to understand Is there any other option to get this achieved (apart from exploring possible data purge) to have this RTO faster or under the service agreement? How should we approach.

Regards

Lok

--
http://www.freelists.org/webpage/oracle-l

Received on Wed Apr 27 2022 - 20:25:55 CEST

This message: [ Message body ]
Next message: Lok P: "Re: Minimize recovery time"
Previous message: Tim Gorman: "Re: Minimize recovery time"
Maybe in reply to: Lok P: "Minimize recovery time"
Next in thread: Lok P: "Re: Minimize recovery time"
Reply: Lok P: "Re: Minimize recovery time"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message