RE: Lightweight method for testing database backup processes

From: Matthew Parker <dimensional.dba_at_comcast.net>
Date: Mon, 21 Aug 2017 12:03:10 -0700
Message-ID: <012f01d31ab0$22f69cf0$68e3d6d0$_at_comcast.net>



You pull them from Glacier and you test them like anything else.  

I was part of the team on the initial design of Glacier at Amazon. One the big issues we were dealing with was bit rot on cheap commodity sata drives versus. We had determined even with implementing the Reed-Solomon algorithms like tape has that we would need to read the data on disk every 11 days to ensure it was not going bad or already bad. In the tape world we only needed to read the tape every 6 months.  

How often do you read your backups (ie test to ensure they are good). If you are not then you are simply rolling the dice and hoping for a lucky 7 instead of coming up snake eyes.  

Specifically for check validate or any command substructure you wish to run in RMAN doesn’t test the equivalent of an actual restore.  

I have had systems that missed writing a block whether it was an Oracle bug or a hardware bug in the RAID Controller or the SAN. The block that is in the system will pass RMAN VALIDATE, but when you actually perform the restore and roll through the archive logs you have a recovery failure as the redo thread that needs to be applied to that block fails with an row level apply mismatch an the backup is basically worthless beyond that point.  

No current tool from Oracle or any Backup Vendor solves the problem from a missing write. I would prefer if the solution could detect from the undo/redo stream that a write was missing instead f actually performing the restore, we will never have that without a restore as just as a restore requires the disk space of the database it would require memory equivalent of the database to work. Only real testing will discover that or similar mismatches in the block.  

Most of your disk based backup systems are not verifying the backup after initial write to ensure overall integrity. I can’t tell you how many sans and backup appliances where the admins have turned off disk scrubbing because it was affecting the performance of the system and therefore there really is no real testing with the backup suite itself.          

Matthew Parker

Chief Technologist

Dimensional DBA

425-891-7934 (cell)

D&B 047931344

CAGE 7J5S7
<mailto:Dimensional.dba_at_comcast.net> Dimensional.dba_at_comcast.net

<http://www.linkedin.com/pub/matthew-parker/6/51b/944/> View Matthew Parker's profile on LinkedIn

<http://www.dimensionaldba.com/> www.dimensionaldba.com
   

From: Mladen Gogala [mailto:gogala.mladen_at_gmail.com] Sent: Monday, August 21, 2017 11:37 AM
To: Matthew Parker <dimensional.dba_at_comcast.net>; nenad.noveljic_at_vontobel.ch; cstephens16_at_gmail.com; oracle-l_at_freelists.org Subject: Re: Lightweight method for testing database backup processes  

I am not working with tapes much these days, mostly with the things like Glacier, cheap remote storage like Isilon or a combination of both. So, backups are kept less than a week on the primary site, around a month on the Isilon and almost forever in the Amazon Glacier. Every modern backup suite has a built-in verification mechanism, which can verify whether backup is good or not. You can also run "restore validate" on a regular schedule. I don't see a big science here. I have restored a TB sized database from Glacier, no problems at all. There are also non-rman mechanisms like SRDF, HUR and SnapVault which can be used for backing up databases. At the last stage, a file backup of the snapshot is performed and stored to the Glacier, to meet regulatory obligations. How do you propose to validate those, on a weekly basis?

Regards  

On 08/21/2017 02:25 PM, Matthew Parker wrote:

I have to disagree with you, in most organizations it is not a DR test. A DR test servers a different purpose than ensuring that your backups and processes are good by testing them on a regular basis..  

It is not faith based testing.

There are a variety of testing that can be performed.  

First there are backups that are offsite besides just database backups.

I have been in a variety of organizations that have quarterly SOX audits where we pull back a set of tapes based on a random selections of files by the auditor to verify that backups are statistically good.

There is also some organizations I have worked for where the requirement was to yearly test all backups and it was not a single yearly test it was were testing backups throughout the year to verify the system was working throughout the year, not just at one selected timepoint, but by the end of the year we had recovered at least once all multi-thousand databases. You normally setup automation to perform the onsite based backups, but the selection of offsite backups to prove those processes too normally has some manual intervention.  

Testing of a single tablespace is a viability test of the database if you fully recover it to open the database. This is how lots of organizations that have databases that are 100TB – 1PB size oracle database test the viability of the backup. They don’t necessarily have enough space to restore every portion of the database but can restore pieces at time.

I also restored system, sysaux, undo and 1 tablespace through multiple cycles so that in the end the complete database was restore tested.  

Having all your DBAs testing restores also keeps them practiced on the process and increases the interaction between the DBA and Backup team which is always a good thing. Yes, I have been at organizations where they do not test backups at all, and then when the oncall is pinged to do the restore something is wrong they fumble through SOPs to try and figure what needs to be done and the recovery takes longer than it should or others have to become involved because the DBAs are not practicing their craft.

It also helps your team capture changes in the process as sometime the different teams don’t communicate well with each other and it is better to discover some change that could be detrimental to you during a test instead of when you really need it.  

When I first started out as a DBA the Senior DBA in our org basically setup a test system and put me through 30 days of disaster recovery training. He would destroy the database and it was my job to restore it and explain how he had destroyed/broken it. It was invaluable training        

Matthew Parker

Chief Technologist

Dimensional DBA

425-891-7934 (cell)

D&B 047931344

CAGE 7J5S7
<mailto:Dimensional.dba_at_comcast.net> Dimensional.dba_at_comcast.net

<http://www.linkedin.com/pub/matthew-parker/6/51b/944/> View Matthew Parker's profile on LinkedIn

<http://www.dimensionaldba.com/> www.dimensionaldba.com
   

From: Mladen Gogala [mailto:gogala.mladen_at_gmail.com] Sent: Monday, August 21, 2017 9:32 AM
To: Matthew Parker <mailto:dimensional.dba_at_comcast.net> <dimensional.dba_at_comcast.net>; nenad.noveljic_at_vontobel.ch <mailto:nenad.noveljic_at_vontobel.ch> ; cstephens16_at_gmail.com <mailto:cstephens16_at_gmail.com> ; oracle-l_at_freelists.org <mailto:oracle-l_at_freelists.org> Subject: Re: Lightweight method for testing database backup processes  

On 08/21/2017 11:04 AM, Matthew Parker wrote:

Most organizations who have to participate in any type of compliance requirements such as SOX Compliance are required to test their backups.

And most organizations do perform such testing. Such practice is called "DR test" and usually occurs once or twice per year. Testing on daily or weekly schedule is something unusual, even if it's only done using "restore validate". Further more, the OP proposed testing backup/restore of a specific tablespace. I don't see how correctness of the tablespace backup guarantees the correctness of the full or incremental database backup? That looks like a faith based testing strategy. Regards

-- 
Mladen Gogala
Oracle DBA
Tel: (347) 321-1217





-- 
Mladen Gogala
Oracle DBA
Tel: (347) 321-1217


--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 21 2017 - 21:03:10 CEST

Original text of this message