RE: Lightweight method for testing database backup processes

From: Matthew Parker <dimensional.dba_at_comcast.net>
Date: Mon, 21 Aug 2017 08:04:32 -0700
Message-ID: <014601d31a8e$ccdefb80$669cf280$_at_comcast.net>



Most organizations who have to participate in any type of compliance requirements such as SOX Compliance are required to test their backups.

It is a good practice for anyone to test as Tim states complex processes on a regular basis to ensure they work as advertised and that some unbeknownst problem hasn’t crept into the process.  

I had one customer who was performing backups by a script setup by a consultant. The script was great and hit all the elements of a good backup. However somewhere in the process in communicating to the person swapping out the tapes, the archive log backup tapes were not being included in the offsite storage for the online backups. The client had over 5000 tapes offsite for the single key database that were basically worthless. A simple test would have identified the problem in the process. I happen to catch their snafu and get a full backup of the database 2 days before their whole san basically went up in a smoke and we were able to recover the system.  

I have been at organizations that are large enough that the simple restored performed on a regular basis ( 1000’s of databases, 2-3 restores (standby re-instantiations a month along with 50-75 archive log restores a month to maintain standby’s) constituted statistical proof that our backups were working and viable.  

However there are some organizations and dependent upon the audit organization that required proof of every single database as they didn’t believe in the concept of statistical analysis (strange for an audit organization but it happens).

In those cases we setup partial and full restores along with running RMAN validate in automation fashion on backups that were still onsite. However of course most of the audit/compliance teams alos require pulling backups from offsite storage and proving restoring too along with even randomly checking tapes in the tape library that the serial number in the catalog matched the serial numbers in the libraries.  

In some organizations the restores were full restores and in some cases we could get by with only doing a partial restore of the database (system, sysaux, undo and one random tablespace) especially overtime where we could show that we never had an incident where we were able to perform an RMAN validate that we couldn’t restore the database, (required from an audit perspective that all restores were tracked in a ticket/reporting system). It also helps if you walk the compliance folks through the requirements of the system components and you help you organization write the SOCX compliance rules for your company surrounding backups. The key part to most compliance related requirements is to try and make access to logs and tickets as easy as possible for the auditors along with ensuring that every recovery is documented. The auditors are much more happy having a documentation trail to follow for any aberration or failure in the systems. The more perfect a system looks, such as we have no failures and the world is perfect, will simply cause the auditors to dig deeper.  

Then from a specific general business principal, I can’t tell you the number of times the CFO has asked about the monthly financial backups going offsite and how sure am I they are good. In the end it doesn’t matter how much confidence I had in the process or how well I could explain all the safeties we had in place, the CFO wasn’t comfortable until we could restore a previous offsite backup at the compliance limit 7ys old and his Director of accounting could actually query the system and say it was good.  

In the end it is all part of the job and more specifically it is a requirement of each of us a DBAs to ensure that the backups are actually good, not assume that they are good. Simply look at the number of RMAN bugs in the Quarterly PSUs or backup SW bugs in their patching.  

I was at one client and we proved through 3 months of testing that the backups in Commvault were not complete because of bugs in their software. They of course have since remedied the bugs in the software, but it was pretty scary understanding that a very expensive piece of commercial software such as backup sw that has to be right, was actually broken and we were at risk of losing our systems in a real failure.    

Matthew Parker

Chief Technologist

Dimensional DBA

425-891-7934 (cell)

D&B 047931344

CAGE 7J5S7
<mailto:Dimensional.dba_at_comcast.net> Dimensional.dba_at_comcast.net

<http://www.linkedin.com/pub/matthew-parker/6/51b/944/> View Matthew Parker's profile on LinkedIn

<http://www.dimensionaldba.com/> www.dimensionaldba.com
   

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Noveljic Nenad Sent: Monday, August 21, 2017 7:31 AM
To: 'cstephens16_at_gmail.com' <cstephens16_at_gmail.com>; gogala.mladen_at_gmail.com; oracle-l_at_freelists.org Subject: RE: Lightweight method for testing database backup processes  

I’ve implemented a restore end-to-end test solution similar to your specification, however with the following differences:

  1. I don’t pick databases to restore randomly. First, I take new databases which have never been restored. Second, I prioritize the databases by the last restore time.
  2. Same as you’ve suggested.
  3. No need to convert the time to SCN. I just specify the time obtained by the step 2.
  4. I restore the whole database, instead of individual tables. Luckily, it is feasible on my site.
  5. The alerting is integrated in the enterprise monitoring solution (Nagios) instead of sending e-mails around.

Besides that, it has proven to be useful to load the rman restore log files into a database, so they can easily be made accessible per self-service to anyone (auditors, internal revision, management etc.) who wants to have the evidence of a successful restore.  

The solution is implemented in object-oriented Perl enriched with a bunch of useful CPAN modules.  

I sleep much better knowing that my databases have been successfully restored every once in a while.  

Cheers,  

Nenad  

Twitter: _at_NenadNoveljic

Home Page: http://nenadnoveljic.com/blog/          

From: oracle-l-bounce_at_freelists.org <mailto:oracle-l-bounce_at_freelists.org> [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Chris Stephens Sent: Montag, 21. August 2017 16:05
To: gogala.mladen_at_gmail.com <mailto:gogala.mladen_at_gmail.com> ; oracle-l_at_freelists.org <mailto:oracle-l_at_freelists.org> Subject: Re: Lightweight method for testing database backup processes  

i implicitly test my car every morning to get to work. i guess it would no longer be a daily test if i lost my job because my database wasn't recoverable. i get your point though.  

thanks for the link Kellyn!

On Mon, Aug 21, 2017 at 8:55 AM Mladen Gogala <gogala.mladen_at_gmail.com <mailto:gogala.mladen_at_gmail.com> > wrote:

Why would you want to rest rman backups? Do you have any doubts of their quality? You can do "restore validate" is you suspect a problem. I understand that you should be cautious, but rman has proven itself many times over, there is no need to test it whether it will work. Do you test your car every morning?  

On 08/21/2017 09:39 AM, Chris Stephens wrote:

We are looking for an efficient way to regularly test RMAN backups across a large (and growing) Exadata database environment.  

After watching this video https://youtu.be/Ds1xrfdlZRc i thought about doing the following:  

create a dedicated, small tablespace in all databases to hold a single table with a single date/timestamp column. create a scheduler job to insert current sysdate/systimestamp value once per day and delete all rows older than recovery window setting for RMAN.  

write a script to 1) randomly pick a database on each Exadata system 2) randomly pick a day that falls within the recovery window requirement for that database 3) converts that day to a valid SCN 4) uses the new table PITR functionality to restore the table 4) confirm expected table content 5) sends success/failure summary email.  

execute the script with a frequency that makes us feel comfortable with our backups.  

we also intend to have a process that utilizes the "restore preview" RMAN command to get a list of backup pieces to run the RMAN "validate" command against for a randomly chosen SCN that falls within recovery window.  

Does anyone see any big issues with this process? Any other ideas for efficiently testing database backups? our databases will soon be large enough to make testing through full restores infeasible.  

any feedback is greatly appreciated!  

thanks,

chris    

-- 
Mladen Gogala
Oracle DBA
Tel: (347) 321-1217 <tel:(347)%20321-1217> 

____________________________________________________

Please consider the environment before printing this e-mail.

Bitte denken Sie an die Umwelt, bevor Sie dieses E-Mail drucken.


Important Notice 
This message is intended only for the individual named. It may contain confidential or privileged information. If you are not the named addressee you should in particular not disseminate, distribute, modify or copy this e-mail. Please notify the sender immediately by e-mail, if you have received this message by mistake and delete it from your system. 
E-mail transmission may not be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete. Also processing of incoming e-mails cannot be guaranteed. All liability of the Vontobel Group and its affiliates for any damages resulting from e-mail use is excluded. You are advised that urgent and time sensitive messages should not be sent by e-mail and if verification is required please request a printed version.




--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 21 2017 - 17:04:32 CEST

Original text of this message