Testing rman recovery from block corruption in an 11g ASM environment

From: hpuxrac <johnbhurley_at_sbcglobal.net>
Date: Sun, 22 Mar 2009 16:08:07 -0700 (PDT)
Message-ID: <670bdc4f-663e-4b3c-96dd-8c47189d72b5_at_p11g2000yqe.googlegroups.com>

Disclaimer and caveat:

Testing database recoveries is a DBA responsibility. Each time you migrate environments this really must be done ... you don't want to be in a place where you tell management ... well the last time we tried this ... several years ago on a different os ... different oracle release ... different storage ... it worked ok.

Corrupting database blocks on purpose ... to setup a database recovery attempt ... is really something that goes kind of above and beyond normal recovery testing. After all ... if you want to believe oracle ... it should all be working right? In my mind however ... you really should test this from time to time.

Obviously testing like this can cause problems real problems so before you attempt something like this you want to have good backups and perform your first sets of testing on a test throwaway type of database before thinking about doing testing on anything more than test type system.

So anyways ... when you are thinking about intentionally damaging blocks in the database ASM throws some more wrinkles in the mix. Most of my environments I have run using RAW and to damage blocks in an environment like that dd is the tool of choice usually ... it just gets a little tricky at times in a RAW environment figuring out offsets and how many blocks to skip before throwing the old /dev/zero hammer down.

At some point I will write up what I found about the additional complications in testing this kind of thing in an ASM environment
( probably into a word document ) ... including specific commands
etc. If you want the writeup please let me know by direct email.

Here's my quick overview of one way this can be tested in an ASM environment. ( My environment was 64 bit OEL 5.1 using database and ASM instance on EMC Clariion )

  1. Have an rman disk backup available and a database in archivelog mode using ASM
  2. create a small tablespace ( MINI ) in ASM
  3. create a table in the new tablespace and insert some rows in it
    ( maybe ( col1 char(2000), col2 char(2000) col3 char(2000) so each row
    is in a separate block ) then commit.
  4. Get a new incremental that updates the previous one ( will get a new base copy of the new tablespace ).
  5. Take the new tablespace offline
  6. Use asmcmd cp command to copy the new tablespace to a file system.
  7. Use some kind of hex editor to damage one or more of the blocks in the copy of the tablespace in the file system ( missed my old dd command here but maybe you can do it with dd? ).
  8. Use asmcmd cp command to put the copied tablespace back into diskgroup.
    • Note that at this point there will be 2 versions of the datafile for the new tablespace in ASM. Limitations currently on copying back in using asmcmd force you to do something like this ... ( you cannot put the whole MINI.266.882111791 enchilada directly back into ASM via asmcmd cp ).

asmcmd <<EOF
  cp -f /home/oracle/temp_rec/MINI.266.682111791 +PROD_DG1/prod/ datafile/MINI

9) Now you have to alter to fix the name of the datafile ... something like ...

1' to '+DISKGROUPNAME/ORACLE_SID/datafile/MINI.268.682115173';

  1. alter the tablespace back online
  2. switch logfiles ( why not )
  3. try to do an rman backup or validate ... should get errors if you whacked MINI correctly ...
    • Should be able to use various views to see what blocks rman ( or validate ) has flagged ...
  4. rman ... one way is blockrecover corruption list

Sorry this posting is so long but thought it would be good to have something out somewhere. I thought it would be easy to locate someone else's step by step guide to testing block recovery in an ASM environment but I did not find much readily available or at least bailed out and decided to figure out steps in detail myself.

There's probably some other better ways of doing this ... it is pretty cumbersome to offline it .. asmcmd cp it out ... damage it ... cp it back in ... alter tablespace it ... and online it ... but my procedure does work. Received on Sun Mar 22 2009 - 18:08:07 CDT

Original text of this message