Fwd: anyone ran into ORA-00752 on dataguard node with ASM/11.1.0.7 Linux?

From: Zhu,Chao <zhuchao_at_gmail.com>
Date: Thu, 7 Jul 2011 23:45:15 +0800
Message-ID: <CABs0AEukN7UKci97Rsdy+MmpqNGUwNuHBnvoSp0qZyRDJY7N_Q_at_mail.gmail.com>

some offline-discussion with the DL member, keep others updated; and removed some might confidential info;

in our case with oracle , SR support guy thought might be related to the ASM rebalance bug; and my point is:
They should not be related to ASm rebalance; We are on local disk(DL380G7 with 16 disks/8 raid1 diskset); So there should no such rebalance work going on;
And the corruption happens often: somtime once/twice a day, sometime once a week;
We have been fighting with it for 2 months and i decided to upgrade it to 11.2.0.2 for a try;
We are on a some-how weird combination: 11.2.0.2 ASM + 11.1.0.7.4 for RDBMS; (Assuming 11.2.0.2 ASM will be more stable/bug-free than the 11.1.ASM, now wondering whether this is a good idea);

Also regarding the primary itself is corrupted: your analysis is very detailed and convincing, but missing one point: if it is indeed missing write in primary, then the primary database is already corrupted(some datafile);
We already did bounce once or twice, nothing reported;

Also, when we saw such corruption, we refrsh that corrupted datafile from primary to standby: and everything works fine: dbv/fts/analyze cascade(to be done, i forgot about that);
If primary is indeed corrupted, standby wont recover either after refresh standby from primary(can someone confirm?), and the only solution would be, as you mentioned, to failover to a clean/consistent copy;

So, my conclusion is oracle is doing some funky work(or the primary database host);
What we have done:
1. suspect it is ASM bug with redo, so moved redo log in primary out of ASM disks; --still fail;

2. also move standby redo out of ASM disk, fails;
3. also moved our a whole standby out of ASM to EXT3, also fails;
4. set lost_write_protect to false, still cant recovery through;

Also, primary dbv/fts has been clean before/after the standby issue, and even through bounce(if it is indeed lost write, a bounce will make the problem obvious in primary as well)?

Thanks for your detailed analysis!

> Ask on mos if all the three standby are failing at the same time
> By the way is your primary facing the same corruption issue
>
> On 6 Jul 2011 20:35, "Zhu,Chao" < <http://mc/compose?to=zhuchao_at_gmail.com>
> zhuchao_at_gmail.com <http://mc/compose?to=zhuchao_at_gmail.com>> wrote:
>
> hi, guys
> We ran into weird corruption in standby nodes in our first production
> oracle/linux deployment, which is pretty frustrate;
> Would appreciate if anyone else has ran into similar issues and share
> with your solution; Working with oracle sucks, no progress in the past 2
> months and now pointing fingers to HW(hp DL380G7); Which obviously is not
> the case as all 3 standby always consistently fails at the same
> block#/log#, and we tried move redo to different
> location/ASM/filesystem(datafile in primary still at ASM);
>
>
> Details:
> Application: Oracle RUEI on Linux(redhat 5.5)
> RDBMS: 11.1.0.7.4
> ASM: 11.2.0.2
> primary is fine;
> dataguard node:
>
> Tue Jul 05 18:10:00 2011
>
> Slave exiting with ORA-752 exception
>
> Errors in file
> /oracle/RUEICEM/home/diag/rdbms/rueicem/RUEICEM/trace/RUEICEM_pr06_18594.trc:
>
> *ORA-00752: recovery detected a lost write of a data block*
>
> *ORA-10567: Redo is inconsistent with data block (file# 84, block# 266478)
> *
>
> ORA-10564: tablespace USERS
>
> ORA-01110: data file 84: '+DATA/rueicem/datafile/users.404.752266499'
>
> ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object#
> 5082441
>
>
> or:
>
> Errors in file
> /oracle/RUEICEM/home/products/diag/rdbms/rueicem/RUEICEM/trace/RUEICEM_pr0e_14272.trc
> (incident=101514):
>
> *ORA-00600: internal error code, arguments: [3020], [84], [266478],
> [352588014], [], [], [], [], [], [], [], []*
>
> *ORA-10567: Redo is inconsistent with data block (file# 84, block# 266478)
> *
>
> ORA-10564: tablespace USERS
>
> ORA-01110: data file 84: '+DATA/rueicem/datafile/rueicem_userruei01_35.dbf'
>
> ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object#
> 5082441
>
> Incident details in:
> /oracle/RUEICEM/home/products/diag/rdbms/rueicem/RUEICEM/incident/incdir_101514/RUEICEM_pr0e_14272_i101514.trc
> some of our investigation done:
>
> 1. All 3 standby database failed at the same log#/block#; Including
> the standby with datafile/redo all on filesystem;
>
> 2. The standby database cant be recovered even set
> db_lost_write_protect = none;
>
> 3. The standby can still be open after it failed at that
> block#/log#; DBV is still clean for that datafile by that time; but no way
> to recover through;
>
> 4. The standby can be recovered through through �recover standby
> database allow 1 corruption�; But dbv did report corrupted block, and fts
> did report corrupted block and fts fails;
> Thanks for your sharing, if we dont have any progress, we will have to
> upgrade RDBMS to 11.2 as well (as 3rd party application, dont really want to
> do that.)
>
>
> --
> Regards
> Zhu Chao
>
>
>
>
>
> --
> Regards
> Zhu Chao
>
>
>
>
>
> --
> Regards
> Zhu Chao
>
>
>

-- 
Regards
Zhu Chao





-- 
Regards
Zhu Chao

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Jul 07 2011 - 10:45:15 CDT

This message: [ Message body ]
Next message: Subodh Deshpande: "Re: SID and Service Name"
Previous message: Mark W. Farnham: "RE: What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message