RE: Minimize recovery time
Date: Wed, 27 Apr 2022 14:25:55 -0400
Message-ID: <067f01d85a64$3c6633f0$b5329bd0$_at_rsiz.com>
Maintain two recoveries with dataguard or roll your own.
Lag application of logs on one of them long enough that you will be able to detect a cyber attack. (The immediate apply one is for routine switch or fail over with approximately no delay.)
You will need space to apply the hold and apply the relevant unapplied logs. Unless the size of your generated logs is huge, then coming back up from your outage you will be able to apply many days of logs in 2 or 3 hours.
Application of logs happens in order without any of the concurrency delays and multi-user issues of the original jobs (and without all those report queries taking up time). Roll forward will pretty much take your breath away, especially if you have space for the logs in the lag period still on whatever "class A" storage is for your environment.
I suppose someone could make an argument that someone could attack your log area, so everything must be air gapped for recovery. That's the argument for alternating backups on at least 2 SANS that only one is plugged in at a time, and you copy all your logs there as well as of the time of backup and copy your logs after that point to an outbound only pipe that requires action to mount.
good luck.
From: Karthikeyan Panchanathan [mailto:keyantech_at_gmail.com]
Sent: Wednesday, April 27, 2022 12:32 PM
To: loknath.73_at_gmail.com; Mark W. Farnham; Andy Sayer
Cc: Oracle L
Subject: Re: Minimize recovery time
Handled similar scenario how quickly we can recover when DB 10TB. RTO was
less than 3hours.
In our case we had many old(history) for compliance data with longer
retention policy. According to me this data is using DB as storage.
Worked with Compliance and Business to push History data into 1 schema/1
tablespace then export as data dump. Archive in Tape with same data
retention policy.
Once exported that data purged to reduce DB size. We we able to bring RTO
under 3 hours.
It worked in our scenario. Sharing here if that any helpful.
Karth
Get Outlook for iOS <https://aka.ms/o0ukef>
From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> on
behalf of Lok P <loknath.73_at_gmail.com>
Sent: Wednesday, April 27, 2022 11:53:34 AM
To: Mark W. Farnham <mwf_at_rsiz.com>; Andy Sayer <andysayer_at_gmail.com>
Cc: Oracle L <oracle-l_at_freelists.org>
Subject: Re: Minimize recovery time
Just checked, its really not for guarding a multi location disaster but
rather , in case of any cyber attack if the data is corrupted in primary the
same will be propagated to secondary/dataguard site. So in that case we will
need to rely on the backup/recovery process RTO.
Also with regards to table/index compression. We are seeing in another
database the table compression with 'compress for query high' option is
decreasing the size of the data to 1/3rd of the original size. So is it safe
to go for this compression as an initial approach and test this OLTP
application against it? But for index it appear to be only Key compression ,
so we need to carefully see the non unique index if any. and what storage
space benefit are we getting out of it. Correct me if wrong.
On Wed, 27 Apr 2022, 7:08 pm Mark W. Farnham, <mwf_at_rsiz.com> wrote:
If the business requirement is truly for multiple site disasters still
providing business continuation you have a difficult task.
First, you should try to gain an understanding amongst the stakeholders that
IF you are guarding against multiple data center disasters (otherwise a
dataguard or a remote standby catch-up and fail-over seems sufficient), that
implies you have a third repository of the data far away from the first two,
most likely with an agreement with a third party to spin up hardware to
recover on at their site.
Very likely they will then understand that your current set up for failover
is sufficient for the requirement.
IF I am wrong about that, then the most likely solution is to introduce time
based partitioning of all the data that in fact has a date after which it is
not allowed to be changed AND is not required for the operations that must
be available for business continuation. (Rarely are old transaction
histories required with the same immediacy as current inventory quantities,
and so forth).
IF sufficient data meeting those characteristics can be identified that will
permanently keep you within your physical reload recovery window, then you
also need to be in a position to shuffle (probably shrinking down free space
and permanently doing useful attribute clustering) partitions to "slower
recovery okay" tablespaces.
Then you can practice the plan to bring up immediately only those
tablespaces required for operations that need the stated business
continuation immediacy (continuing the reload and recovery of the other
tablespaces after business of the critical functions resumes.)
Avoiding this entire race dance is the point of online recovery mechanisms:
Modern systems, often on SSD, quite often grow far too large to "back up"
onto more persistent storage in terms of the ability to read that persistent
storage back onto storage connected to your machine.
Another possible way to do this is to plug multiple SANs into your
machine(s). This, of course, does not handle the multiple site disaster
problem. You don't keep any "current" data on the alternated SANs (of which
you have a minimum of two), because you never start overwriting your only
complete backup file set. After a backup is complete, the relevant SAN is
physically disconnected.
Then, in a "storage disaster", after you clean up the host from the software
hack that was likely the cause, you connect your most recent backup SAN and
away you go.
Not all machines have connections for plugging in multiple SANs, and of
course you can't make these SANs "virtual" storage. You're unplugging one to
make it air gapped from attack. You might have an air gapped machine to plug
it into to run full surface scans and memory checks (SSD), but that entire
set-up is non-networked.
When they balk at the cost, perhaps it is time to engage a certified actuary
to explain to them what rare case they are insuring against (and probably
don't have all the things they need to in order to make the plan possibly
succeed.)
And, of course, any plan you have that you don't test regularly is just
wishful thinking. Testing plug-in replacement storage is probably a bigger
risk than relying on something like dataguard or storage snapshots.
If they are worried about this, do they have multiple physically independent
communications infrastructure? How about power generators?
Good luck,
mwf
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of Lok P
Yes they are on different datacenter and those too in different locations.
And backup is being taken in both primary and secondary through ZDLRA and i
believe the respective backups must be kept in the respective data centre in
their configured ZDLRA storage.
On Wed, 27 Apr 2022, 4:06 pm Andy Sayer, <andysayer_at_gmail.com> wrote:
Your dataguard is using the same storage as your primary? Usually it would
be a whole different data centre. Where are your backups going?
On Wed, 27 Apr 2022 at 11:35, Lok P <loknath.73_at_gmail.com> wrote:
Yes we have dataguard setup , but this agreement is in place in case of both
primary and dataguard DB fails because of disaster or corruption etc.
On Wed, 27 Apr 2022, 3:30 pm Andy Sayer, <andysayer_at_gmail.com> wrote:
Have you considered Dataguard? You'd have a secondary database always ready
to failover to.
Thanks,
Andy
On Wed, 27 Apr 2022 at 10:50, Lok P <loknath.73_at_gmail.com> wrote:
Hello Listers, We have an Oracle Exadata (X7) database with 12.1.0.2.0 and
it's now grown up to size 12TB now. As per client agreement and criticality
of this application the RTO(Recovery time objective) has to be within ~4hrs.
The team looking after the backup recovery has communicated the RTO(recovery
Sent: Wednesday, April 27, 2022 7:21 AM
To: Andy Sayer
Cc: Oracle L
Subject: Re: Minimize recovery time
Going through the top space consumers, we see those are table/index sub-partitions and non partitioned indexes. Should we look into table/index compression here? But then i think there is also downside of that too on the DML performance.
Wanted to understand Is there any other option to get this achieved (apart from exploring possible data purge) to have this RTO faster or under the service agreement? How should we approach.
Regards
Lok
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Apr 27 2022 - 20:25:55 CEST