RE: Backups versus snapshots

From: Dimensional DBA <dimensional.dba_at_comcast.net>
Date: Sun, 21 Sep 2014 19:05:17 -0700
Message-ID: <005701cfd609$a988d380$fc9a7a80$_at_comcast.net>



That depended on the company.

The one company that I worked at that were very good at mothballing equipment, each team that had a part in the recovery had a part in producing and testing the documentation before it was mothballed. I happen to be at that company during the Y2K preparations we actually performed our tests assuming that the remote DR site personnel had to recover the systems using the documentation provided without our help. It was a very good test letting a system admin who is unfamiliar with your systems perform the full recovery just following your documentation. Documentation was not just about the single steps for any one system but order of recovery, OS, infrastructure, app, and database.

Most places I have been at we have created documentation under the same type concepts that all groups must participate in the documentation process and testing. Testing is key. However there are cases where full testing is not possible and you simply have to plan with as many redundancies as possible and interleaving systems as possible to with a certain confidence level ensure success. An example when I was with Oracle consulting was a backup and recovery process for a specific US Government agency to ensure system recovery after a CBRN attack or disaster. You can test the processes under non-CBRN conditions but trying to with some level of confidence ensure under full CBRN worst case scenarios becomes rather impossible to guarantee. You can only provide levels of confidence with the understanding that even with a 99.99% level of confidence you could still fail and then you put in redundancies based off of boot strap system from scratch (fresh install) and start running. Not the option that people want, but an option you always have to consider for worse case disasters. During that Y2K testing we had the backup catalog fail to recover and had to read all the media in to rebuild the catalog. We added that to our process as we had to engage Symantec support to work through their undocumented tools, which under a real disaster may not have been available, so the testing picked up another use case we hadn't planned for. We had all sorts of use cases on different failure scenarios and what you had to do to get past them all in the documentation.

In the regular business world at Amazon all the engineering teams participated in the documentation and testing of their different parts. The backups team responsibility including ensuring we had isolated all the systems such as the kickstart servers, the wiki media databases where documentation was stored, the databases and the SW from the app teams stored in SVN repositories and database SW images from the database engineering team and then our backup SW itself.

It takes a lot to restore a complete environment so that everything is available for the next team to do its part. The DBAs are a small part in the big picture. Most of the time is spent in simple thought experiments to come up with all the possible disasters and holes in your plans that you need to create an SOP for in advance of needing it.

Matthew Parker
Chief Technologist
425-891-7934 (cell)
Dimensional.dba_at_comcast.net
View Matthew Parker's profile on LinkedIn

-----Original Message-----
From: Herring, David [mailto:HerringD_at_DNB.com] Sent: Sunday, September 21, 2014 12:02 PM To: dimensional.dba_at_comcast.net; yparesh_at_gmail.com; iggy_fernandez_at_hotmail.com Cc: kmoore_at_zephyrus.com; 'Oracle-L_at_freelists.org' Subject: RE: Backups versus snapshots

Matthew,

Great info and thanks for fully explaining what you've done in the past. I'm curious about the documentation aspect. Were the DBAs in charge of documenting every procedure? Who validated the procedures were kept up-to-date? Where did you store that documentation so it would be fully accessible from wherever needed (and obviously retained as long as there are backups that require it)?

It's one thing to figure out how something should be done and prove it works. It's another to make sure that information is available to everyone who coiuld need it.

Dave Herring

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Dimensional DBA Sent: Friday, September 19, 2014 1:20 AM To: yparesh_at_gmail.com; iggy_fernandez_at_hotmail.com Cc: kmoore_at_zephyrus.com; 'Oracle-L_at_freelists.org' Subject: RE: Backups versus snapshots

We didn’t follow the version route. The Global backups team at Amazon was 1 manager and 3 engineers, responsible for tape systems and backup SW worldwide (6 PB/month). The objective at Amazon was automation, not scaling the human work force and solve the problems of today and the future with good design.

The new process was to simply upgrade and uplift media in the 7 year cycle. The 7 year cycle was based calculations on failure of tape media with the more hostile 90 degree Fahrenheit temperature environment and the 3 week over write reuse of the tapes in most instances. The object was to eliminate the equipment version problem by having newer versions of the LTO tape drives come into play and by the second version down the road that could still read 2 versions back start the retrofit of the old media to newer version media, then basically dump all the old equipment. You do what you have to do and then modify process, procedures and architecture to eliminate your problems. We also did a lot with process to eliminate humans interacting with the media which has the greatest potential for tape loss and monitoring the equipment to ensure proper operation of the tape drives and rate detection to adjust backups to eliminate shoe shining of tapes.

For the specific recovery I had external disk copies of Oracle back to Version 7. I sort of maintain my own version copies at home. (If anyone has versions older than 7.x I would love to have them.) I also maintain older copies of Linux, HPUX, Solaris and MS Windows. At the time for hardware as planning wasn’t done for maintaining older versions of equipment or what would you need to do to really restore, I fell back on ebay for equipment for the actual drive as I already had an old server tucked away.

The 1997 copy was an actually export instead of regular backup as someone had the foresight at the time to think about the complexity of full database restore if you were not storing all the other components although later I fond CD s of a backup too. The recovery was actually off of 8mm DAT. I had a few later ones from CDs. Realistically speaking some of the recoveries were simply luck based on the lack of care of the media. The longer time that you may be thinking of is from most people thinking linearly instead of performing multiple tasks in parallel where possible. I have seen a backup team not start any work until the tape is actually in their hands, when there is associated work that could be performed like prep the server and install he relevant SW, then start the restore as soon as the media arrives. I have had that need to push some teams in certain companies as the media was awaiting for the DBA team to formally request the backup team to start the restore.

In this case the 8mm tape was actually in a desk drawer I took over from the previous manager, so retrieving the item in question a simple check off the list as it was at my desk. If I would have had to go through the +12K tapes that were stored in Iron Mountain prior to the year 2004, then it would have taken much longer as the previous team had performed an upgrade in 2003 on Symantec SW and had simply installed net new and the previous catalog was lost. Also in 2000 Iron Mountain had upgraded their systems and everything prior to 12/2000 was listed as ingested by Iron Mountain on that date. So if I really wanted a specific tape in Iron Mountain before 2003, then we would have had to retrieve every tape from Iron Mountain prior to 2004 read them all to rebuild a catalog to find anything (Estimated time would have been 9 months). There are lots of things that can go wrong in the infrastructure if you are not thinking about long term in the future. That includes disk storage if your vendor is not using some technology to counter bit rot and verification of data moves from point A to point B as they perform data moves to upgrade equipment.

The fact I had the media immediately available then it was just kick start the server and install the OS, then install Oracle, less than 6 hours. The longest wait was 2 days for the arrival of the drive, (Saved me time from having to dig through hundreds of boxes in my storage shed where I know tape drives of all sort remains buried to this day.)

Once we got the processes down, we had tape backups of the OS kickstart servers with all copies of all images of the OS used along with all the database SW installable SW homes, so we could restore to any specific OS and version. You still have to deal with driver problems with the OS sometimes or relink problems with the Oracle homes. In some cases this is why a complete system image including database may be stored. (Every situation has some differences as what you are trying to do.)

As to small business or large business you have to have a process and understand the technology. Tape systems are not expensive for the small business if you use the smaller version systems from the smaller tape system vendors. An example you can buy a single tape drive desktop unit with 8 slots, I had one of the first ones back in 1996, that currently cost only about $4K, whereas for anyone that purchases large scale system vendors know that the list price on a single say LTO6 drive is 5 times that. There are also replaceable disk units or as I have seen at some small companies they simply attach a USB drive to a back of the server, then send it off to storage to Iron Mountain. It is the longer term process that media needs to be pulled back and converted if necessary. If you are a business under regulatory compliance you will do what is necessary to accomplish the task. How well you accomplish the tasks varies sadly by the humans involved.

I remember writing backups to tape systems with mt and tar and manipulating the library directly with shell scripts before all the nice backup SW existed. It is all doable for even small companies, but you must have process. Yes, it takes some extra human effort to perform the job. I have seen some really small business such as a local community center (non-profit which means spend no money if possible as it all should be spent to help the community) that had a single DAT drive on their windows server and pulled the tape out every morning after the backup ran the night before as a safety measure for their data. Yes, the kindly old lady at the computer dutifully took the tape backup from the previous night home every night in her purse and kept it in a cabinet for a couple months before bringing the old tapes back.

You have a variety of choices to make relative to cost, simplicity, risk etc. There is not one right answer for every business as each aspect of your choices have different priorities to the business, (disk, tape, cloud, nothing)

I worked at a lot of companies working closely with the CFO on financial systems and from the CFO perspective their concept was do what is necessary to ensure compliance and keep them out of jail. You present options with risk analysis and they will choose what level of risk versus cost they are willing to take. Your job is to implement the system with proper monitoring and processes to ensure reliability including testing restores on a regular basis. Even the best system can be undermined by the humans or by neglect.

Matthew Parker
Chief Technologist
425-891-7934 (cell)
Dimensional.dba_at_comcast.net
View Matthew Parker's profile on LinkedIn

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Paresh Yadav Sent: Thursday, September 18, 2014 9:15 PM To: iggy_fernandez_at_hotmail.com
Cc: kmoore_at_zephyrus.com; Oracle-L_at_freelists.org Subject: Re: Backups versus snapshots

Thanks Matthew for sharing your valuable experience at Amazon.. As Hemant mentioned you must have preserved all associated tech (tape library to read the old tapes, machine and OS version that can run the old db software version, db software at version that can restore the backups etc.). And this needs to be done for all possible tech (hardware and software) and its version that gets used over a period of time. Amazon can afford the infrastructure and manpower required to maintain this but how does a SMB meet 7 year regulatory retention requirement?

What was typical time to recover a 1997 Oracle db backup (probably Oracle version 7.x) in 2010 after having to install Oracle 7.x software on a compatible OS and hardware? This will involve not only locating the backups but also the software install media and the hardware that can run the software.

Thanks
Paresh
416-688-1003

On Fri, Sep 19, 2014 at 12:04 AM, Iggy Fernandez <iggy_fernandez_at_hotmail.com> wrote: snapshots or backups are just means to an end; that is, meeting the availability and regulatory requirements within the available budget. if, for example, you have regulatory requirements to store data for a certain number of years, then you could copy the contents of the snapshots to tape.

re: if the database goes poof then the snapshot is gone as well

if the database goes poof, then the snapshot remains

iggy

> To be clear, the snapshots are not physical copies of the database. They only track the differences between the database at the time of the snapshot and the current time. So if the database goes poof then the snapshot is gone as well.

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Sep 22 2014 - 04:05:17 CEST

Original text of this message