RE: recover standby database failure

From: Carel-Jan Engel <careljan_at_dbalert.eu>
Date: Thu, 15 May 2008 23:03:29 +0200
Message-Id: <1210885410.3522.23.camel@lagavulin.dbalert.eu>


At a customer site, with Standard Edition 'scripted' archive log shipping standby, they want a true DR test. This means: reverse roles, and run the business with the DR system for a couple of hours, and reverse roles again. Any site should do this, but most of them don't dare. Why spending money on a DR site if you don't trust it?

Both databases and instances have the same name. Disk layout and naming of mountpoints is identical at primary and standby. 'Awareness' of being a primary or standby is in the control file. Datafiles are identical at primary and standby, if recovery is succesful.
You can test the standby by opening it read only, but doesn't allow the business to use it.

'Activating' the standby would require a re-instantiate of the primary. Given the size of the database and the available bandwidth, that is not an option just for testing purposes.

The DR test goes along the following path:

  1. Shutdown normal the primary.
  2. Ship the last archived redo logfiles to DR.
  3. Make sure the last archived redo log files have been applied at the standby
  4. Make backups of all control files, parameter files, online redo log files (yes I wrote online redo log files) at both primary and standby. Maybe you can skip the ORLFs, but I haven't tested that.
  5. 'Swap' control files, parameter files and OLRFs between primary and standby. This limits the amount of data exchanged through the WAN to a minimum.
  6. Start the instance at the standby, as were it the primary. Actually, because its controlfile now is that of the primary, it is the primary.
  7. Start the instance at the primary site as were it the standby. Same story about controlfile.
  8. Start the listener and applications, and let the users do what they do when they use the system.
  9. Run the archive redo log copy scripts in the reversed direction, from DR to primary.
  10. After the test, go to step 1 to get back to normal.

After testing the whole thing with a test database it was scripted by the local DBA. Now the CT has a SE archive log shipping standby with switch over capabilities. No cloning necessary.

About this test database: I always have an, as small as possible, test database at every production system with a HA setup, just to be able to test all infrastructure components involved. This test database has a standby as well. It is useful for training, testing firewall stuff and other LAN/WAN issues, gaining experience, testing anything else regarding the HA setup, gaining self confidence.

Best regards,

Carel-Jan Engel

===
If you think education is expensive, try ignorance. (Derek Bok) ===

On Thu, 2008-05-15 at 11:24 -0400, Mark W. Farnham wrote:

> Most likely the operation of opening a standby manually managed as described
> is destructive unless you cancel recovery, shut down, copy clone and do a
> startup rename resetlogs on the clone to test if you have in fact correctly
> manually managed a "roll your own" standby. Then if the open is successful
> you probably need to run a lot of reports to make sure the recovery test was
> actually successful rather than only apparently successful. Why I might not
> be satisfied unless all the weekly and month end reports appeared to be
> perfect! And who better to evaluate whether the reports look correct than
> the folks who otherwise might be running those reports on the production
> primary database?
>
> Since manually managing a recovery standby is error prone, I do recommend
> executing this copy clone open frequently. The renamed database can be used
> as a "frozen reporting database" while recovery is resumed on the standby of
> the original name. If you're rolling your own rather than using Oracle's
> software to manage a standby, there several things you can do to destroy the
> validity of the standby (such as unlogged actions on the "primary"). Of
> course if you regularly test your standby by doing a clone/rename/open
> resetlogs, that normally will decouple you from a simultaneous actual
> problem with your "primary." Over time you can do the math of frequency of
> errors detected with the standby process versus frequency of problems of the
> "primary" to determine your risk and ask management whether they want to
> spend more money to reduce that risk.

<snip>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu May 15 2008 - 16:03:29 CDT

Original text of this message