Re: Lost Writes
Date: Thu, 6 Aug 2009 20:17:31 -0600
Sadly, no. A full export will really only assure you that your database contains no *physically* corrupt blocks, i.e., blocks that are unreadable or where the checksums don't match the data.
But in the case of a "lost write" it is entirely *possible* that each and every block in your database will be completely "valid", with completely legitimate checksums. *Some* of them, though, might be a little bit older than they should be.
Imagine taking two "snapshot" backups of your database, five seconds apart. Restore the database from the later snapshot, and randomly then select 100 to 1000 blocks from the older snapshot and copy them the the same locations in the datafiles restored from the new snapshot. You now have a database image that resembles what you *might* encounter following "lost writes". Individually, each block is "valid", and has a legitimate checksum. But some of the blocks simply do not contain the data they *should*.contain.
The discrepancies could be detected while applying redo -- when rows that should be there are not, I imagine -- or they might be detected when queries report that ROWIDs contained in the index do not exist in the table. They might not get noticed until you find your queries return different resultsets when they use different indexes (good luck with that). They may not get detected at all.
Happily, I have encountered this only once - when a firmware error caused a disk array to crash, apparently with the loss of the "non-volatile" cache. In this case, the only symptoms we ever saw were queries reporting that rowids found in certain indexes were not present in the tables. And *this* was not noted until days after the failure of the disk array.
On 10g, an rman BACKUP VALIDATE DATABASE CHECK LOGICAL can detect some of
these errors (e.g., the case where indexes don't match the tables) but
probably won't detect others. As I recall, the documentation for this says
(to paraphrase) :
*BACKUP VALIDATE CHECK LOGICAL will detect all types of corruption that are possible to detect.*
Unfortunately, it does not detect any of the other ones. It is still a useful precaution, though, one that I used myself only last week following a myterious failure of the NFS appliance hosting one of the 9i databases I manage. In this case, at least, RMAN gave me a clean bill of health. I might have been tempted to do a full database recovery anyway, but in the absence of a clear problem and following 12 hours of downtime, I would never have been given permission.
Anyway, this is what I recall on the subject. Now I will sit back and wait for some of the more knowledgeable people on this list to point out where I have gone wrong. :-)
On Thu, Aug 6, 2009 at 2:09 PM, Charlotte Hammond < charlottejanehammond_at_yahoo.com> wrote:
> Thanks to everyone who replied.
> We have replaced both the drives and the controller in this box. But in
> answer to the questions, RMAN did not report any errors with check logical
> and exp to /dev/null equally did not produce any errors. Does this mean
> that the lost writes / corruption was in the archive logs rather than the
> database backup? If so, why didn't RMAN refuse to back them up?
> We didn't attept to repair the corruption as it was a test database and
> transactions after the ORA-600  were disposable.
> Unfortunately DB_LOST_WRITE_PROTECT only helps if you're on 11g, have a
> standby database and can cope with the performance penalty - none of this
> applies here.
> ----- Original Message ----
> From: Robert Freeman <robertgfreeman_at_yahoo.com>
> To: Richard.Goulet_at_parexel.com; charlottejanehammond_at_yahoo.com;
> Sent: Thursday, August 6, 2009 5:26:57 PM
> Subject: Re: Lost Writes
> RMAN should detect a corrupt block during a backup and report it. Of
> course, if the block is not in use then it would not be an issue.
> Did you try a backup validate check logical database from RMAN?
> Did you check v$backup_corruption or v$database_block_corruption to see if
> anything was present in that view ?
> The corrupt blocks you had, were they actually in use? Or allocated and not
> assigned to any objects?
> Did you try to repair the block?
> Robert G. Freeman
> Oracle ACE
> Oracle Database 11g RMAN Backup and Recovery (Oracle Press) - ON IT'S WAY
> OCP: Oracle Database 11g Administrator Certified Professional Study Guide
> Oracle Database 11g New Features (Oracle Press)
> Portable DBA: Oracle (Oracle Press)
> Oracle Database 10g New Features (Oracle Press)
> Oracle9i RMAN Backup and Recovery (Oracle Press)
> Oracle9i New Features (Oracle Press)
> Other various titles out of print now...
> Blog: http://robertgfreeman.blogspot.com
> The LDS Church is looking for DBA's. You do have to be a Church member in
> good standing. A lot of kind people write me, concerned I may be breaking
> the law by saying you have to be a Church member. It's legal I promise! :-)
> ----- Original Message ----
> From: "Goulet, Richard" <Richard.Goulet_at_parexel.com>
> To: charlottejanehammond_at_yahoo.com; oracle-l_at_freelists.org
> Sent: Thursday, August 6, 2009 9:25:46 AM
> Subject: RE: Lost Writes
> Best idea in this case is to check that all of your data is available
> and the best way that I know of to do that is to do an exp full=yes to
> /dev/null. If that finishes cleanly then your 99% sure that everything is
> alright. After that I would take another backup.
> Dick Goulet
> Senior Oracle DBA
> PAREXEL International
> -----Original Message-----
> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
> On Behalf Of Charlotte Hammond
> Sent: Wednesday, August 05, 2009 5:29 PM
> To: oracle-l_at_freelists.org
> Subject: Lost Writes
> Hi All,
> We recently had a disk fault on one of our test servers. After replacing
> the drive I recovered the database using RMAN only to encounter an ORA-600
>  part way relating to blocks in the system tablespace. I had to do a
> point-in-time restore to get the database back. We lost transactions (but
> since it was test not a big deal).
> As I understand it this problem is possibly due to lost writes (i/o not
> going to disk but saying it has) due to the disk drive going bad (it may be
> due to other things, but lets assume that's the problem for the sake of
> discussion). What I'm trying to understand is:
> 1. Will this only manifest itself during a recovery or will I get a similar
> error if I try to update a stale block using SQL?
> 2. Given I used the same backup but terminated the recovery earlier,
> presumably the stale block is still there, I just avoiding applying recovery
> to it. Do I now have a "bad" database? - should I exp/imp? DBV shows
> nothing amiss but presumably these will not be seen as logical corruptions
> as the blocks are simply old, not badly formed.
> 3. I see 11g can detect lost writes in conjunction with a standby. Is
> there anything I can do on stand-alone 10g? I'm now very paranoid I'll get
> the same thing in production and really loose data.
-- Cheers, -- Mark Brinsmead Senior DBA, The Pythian Group http://www.pythian.com/blogs -- http://www.freelists.org/webpage/oracle-lReceived on Thu Aug 06 2009 - 21:17:31 CDT