Re: SunFire Server Hangs

From: Mladen Gogala <mgogala_at_yahoo.com>
Date: Sun, 08 Mar 2015 00:50:46 -0500
Message-ID: <54FBE336.5030708_at_yahoo.com>



Is the whole server hanging or just Oracle? Can you ssh into the server? Unfortunately Solaris is not unbreakable, like Linux.

On 3/7/2015 11:49 PM, MacGregor, Ian A. wrote:
> It’s all internal storage not using ASM. The oracle version is
> 11.2.0.3.
>
> Ian
>> On Mar 6, 2015, at 4:50 PM, Andrew Kerber <andrew.kerber_at_gmail.com
>> <mailto:andrew.kerber_at_gmail.com>> wrote:
>>
>> Sounds like problem with storage to me. Is it in local storage or
>> SAN? ASM or file system? Also, what oracle version.
>>
>> Sent from my iPad
>>
>> On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian_at_slac.stanford.edu
>> <mailto:ian_at_slac.stanford.edu>> wrote:
>>
>>> Over the past 26 months or so we have three SunFire x86 servers
>>> hang, two, quite recently, within a few weeks of each other. The
>>> servers show no signs of high activity before the freeze None of
>>> the monitoring scripts we run indicate any problem at all before
>>> the freeze. When it happens the machine is hangs, it does ping,
>>> and can be reset through the sp.
>>> Looking at the boot events. There is a system downtime which
>>> matches when the freeze occurs,
>>>
>>> It has a very bad impact on Oracle which suffers from lost writes
>>>
>>> SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments:
>>> [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [],
>>> [], [], []
>>>
>>> This error is mostly associated with power outages. In this case
>>> there was no loss of power.
>>>
>>> I reported the first machine a while back explaining automatic fail
>>> over failed to occur when a machine is not quite dead.
>>>
>>> The more recent failures have happened on machines which didn’t
>>> have a physical standby.
>>>
>>> I could not find the article about fixing this problem through
>>> applying the current redo log . All I could find were article
>>> saying the database was unrecoverable. Ensuring the backups were
>>> valid. I proceeded.
>>>
>>> The database reported a corrupt rollback segment. I switched to
>>> manual undo, made sure there were no partially availble
>>> segments. Created a new undo tablespace and successfully opened the
>>> database.
>>>
>>> We had one other problem an index disagreed with its table. The
>>> index was missing a row. We were able to ascertain the program
>>> which was using the index. It was a CDC job which was no longer
>>> needed. Oracle was perfectly happy with the situation. There was
>>> no reported corruption unless the affected index block was read by
>>> a query. Another problem was the index and table are bootstrap
>>> objects . We eventually used impdp/expdp to move to another
>>> machined. But I am getting ahead of myself.
>>>
>>>
>>> When I brought the database up with th new undo tablespace, all
>>> Oracle scheduler jobs reported they could not open a wallet. I’m
>>> not sure which wallet is referenced here. Also clients which
>>> called PL/SQL programs to open a wallet could not open a different
>>> wallet involved in authenticating to AD. However if the code was
>>> executed on the database server itself, the wallet opened without a
>>> problem. Restarting both the database and the listener fixed the
>>> problem. I have been able to find any information on this.
>>>
>>> On another database the corrupt rollback segments included some
>>> partially available ones. This was a QA database and was refreshed
>>> from backup. Again the problem with the wallet occurred.
>>>
>>> On the third database both active redo logs were corrupt. It too
>>> was recovered from backup. It also had the wallet problem.
>>>
>>> So even with no loss of power, yes the RAID cache batteries were
>>> goof. and having multiplexed redo logs. We needed to recover from
>>> backup on two databases, and on the other had a bootstrap table and
>>> index in disagreement
>>>
>>> Ian MacGregor
>>> SLAC National Accelerator Center
>>>
>

-- 
Mladen Gogala
Oracle DBA
http://mgogala.freehostia.com


--
http://www.freelists.org/webpage/oracle-l
Received on Sun Mar 08 2015 - 06:50:46 CET

Original text of this message