Re: SunFire Server Hangs

From: MacGregor, Ian A. <ian_at_slac.stanford.edu>
Date: Sun, 8 Mar 2015 04:49:46 +0000
Message-ID: <A6770168-AB62-4F95-999B-C90524358710_at_slac.stanford.edu>



It’s all internal storage not using ASM. The oracle version is 11.2.0.3.

Ian
On Mar 6, 2015, at 4:50 PM, Andrew Kerber <andrew.kerber_at_gmail.com<mailto:andrew.kerber_at_gmail.com>> wrote:

Sounds like problem with storage to me. Is it in local storage or SAN? ASM or file system? Also, what oracle version.

Sent from my iPad

On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian_at_slac.stanford.edu<mailto:ian_at_slac.stanford.edu>> wrote:

Over the past 26 months or so we have three SunFire x86 servers hang, two, quite recently, within a few weeks of each other. The servers show no signs of high activity before the freeze None of the monitoring scripts we run indicate any problem at all before the freeze. When it happens the machine is hangs, it does ping, and can be reset through the sp. Looking at the boot events. There is a system downtime which matches when the freeze occurs,

It has a very bad impact on Oracle which suffers from lost writes

SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], []

This error is mostly associated with power outages. In this case there was no loss of power.

I reported the first machine a while back explaining automatic fail over failed to occur when a machine is not quite dead.

The more recent failures have happened on machines which didn’t have a physical standby.

I could not find the article about fixing this problem through applying the current redo log . All I could find were article saying the database was unrecoverable. Ensuring the backups were valid. I proceeded.

The database reported a corrupt rollback segment. I switched to manual undo, made sure there were no partially availble segments. Created a new undo tablespace and successfully opened the database.

We had one other problem an index disagreed with its table. The index was missing a row. We were able to ascertain the program which was using the index. It was a CDC job which was no longer needed. Oracle was perfectly happy with the situation. There was no reported corruption unless the affected index block was read by a query. Another problem was the index and table are bootstrap objects . We eventually used impdp/expdp to move to another machined. But I am getting ahead of myself.

When I brought the database up with th new undo tablespace, all Oracle scheduler jobs reported they could not open a wallet. I’m not sure which wallet is referenced here. Also clients which called PL/SQL programs to open a wallet could not open a different wallet involved in authenticating to AD. However if the code was executed on the database server itself, the wallet opened without a problem. Restarting both the database and the listener fixed the problem. I have been able to find any information on this.

On another database the corrupt rollback segments included some partially available ones. This was a QA database and was refreshed from backup. Again the problem with the wallet occurred.

On the third database both active redo logs were corrupt. It too was recovered from backup. It also had the wallet problem.

So even with no loss of power, yes the RAID cache batteries were goof. and having multiplexed redo logs. We needed to recover from backup on two databases, and on the other had a bootstrap table and index in disagreement

Ian MacGregor
SLAC National Accelerator Center

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Mar 08 2015 - 05:49:46 CET

Original text of this message