RE: SunFire Server Hangs

From: MacGregor, Ian A. <ian_at_slac.stanford.edu>
Date: Mon, 9 Mar 2015 16:52:29 +0000
Message-ID: <f7422223da8c452fa19720bc288e2112_at_exch13-mail03.win.slac.stanford.edu>



Well worth considering. I'm not an expert on the machine architecture, but I think there is a separate path for the system disks vs. the "user" disks. If so then it seems that at least two paths would need to be unavailable.

-----Original Message-----
From: Andrew Kerber [mailto:andrew.kerber_at_gmail.com] Sent: Monday, March 09, 2015 9:14 AM
To: MacGregor, Ian A.
Cc: oracle-l_at_freelists.org
Subject: Re: SunFire Server Hangs

I can't figure out a mechanism that may cause this, but it sounds like the operating system is losing its path to the storage.

Sent from my iPad

> On Mar 9, 2015, at 10:21 AM, MacGregor, Ian A. <ian_at_slac.stanford.edu> wrote:
>
> It's the entire server which hangs. Once we can access the machine, after the reset though the system process, a check of the system down time shows the time the machine froze. It's like the server panicked, but did not make it all the way down as it remains pingable. When it happens all programs which might provide some information as to the cause stop. However, up to that point things look very normal indeed.
>
> All the storage is onboard. These machines accommodate 16 drives internally.
>
> None of the machines which has had this problem is clustered. They are dedicated database machines. The OS is Solaris 10.
>
> Ian MacGregor
> SLAC National Accelerator Center
>
> -----Original Message-----
> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Mladen Gogala
> Sent: Saturday, March 07, 2015 9:51 PM
> To: oracle-l_at_freelists.org
> Subject: Re: SunFire Server Hangs
>
> Is the whole server hanging or just Oracle? Can you ssh into the server? Unfortunately Solaris is not unbreakable, like Linux.
>
> On 3/7/2015 11:49 PM, MacGregor, Ian A. wrote:
>
>
> It’s all internal storage not using ASM. The oracle version is 11.2.0.3.
>
> Ian
>
>
> On Mar 6, 2015, at 4:50 PM, Andrew Kerber <andrew.kerber_at_gmail.com> wrote:
>
> Sounds like problem with storage to me. Is it in local storage or SAN? ASM or file system? Also, what oracle version.
>
> Sent from my iPad
>
> On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian_at_slac.stanford.edu> wrote:
>
>
>
> Over the past 26 months or so we have three SunFire x86 servers hang, two, quite recently, within a few weeks of each other. The servers show no signs of high activity before the freeze None of the monitoring scripts we run indicate any problem at all before the freeze. When it happens the machine is hangs, it does ping, and can be reset through the sp.
> Looking at the boot events. There is a system downtime which matches when the freeze occurs,
>
>
> It has a very bad impact on Oracle which suffers from lost writes
>
> SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], []
>
> This error is mostly associated with power outages. In this case there was no loss of power.
>
> I reported the first machine a while back explaining automatic fail over failed to occur when a machine is not quite dead.
>
> The more recent failures have happened on machines which didn’t have a physical standby.
>
> I could not find the article about fixing this problem through applying the current redo log . All I could find were article saying the database was unrecoverable. Ensuring the backups were valid. I proceeded.
>
> The database reported a corrupt rollback segment. I switched to manual undo, made sure there were no partially availble
> segments. Created a new undo tablespace and successfully opened the database.
>
> We had one other problem an index disagreed with its table. The index was missing a row. We were able to ascertain the program which was using the index. It was a CDC job which was no longer needed. Oracle was perfectly happy with the situation. There was no reported corruption unless the affected index block was read by a query. Another problem was the index and table are bootstrap objects . We eventually used impdp/expdp to move to another machined. But I am getting ahead of myself.
>
>
> When I brought the database up with th new undo tablespace, all Oracle scheduler jobs reported they could not open a wallet. I’m not sure which wallet is referenced here. Also clients which called PL/SQL programs to open a wallet could not open a different wallet involved in authenticating to AD. However if the code was executed on the database server itself, the wallet opened without a problem. Restarting both the database and the listener fixed the problem. I have been able to find any information on this.
>
> On another database the corrupt rollback segments included some partially available ones. This was a QA database and was refreshed from backup. Again the problem with the wallet occurred.
>
> On the third database both active redo logs were corrupt. It too was recovered from backup. It also had the wallet problem.
>
> So even with no loss of power, yes the RAID cache batteries were goof. and having multiplexed redo logs. We needed to recover from backup on two databases, and on the other had a bootstrap table and index in disagreement
>
> Ian MacGregor
> SLAC National Accelerator Center
>
>
>
>
>
>
> --
> Mladen Gogala
> Oracle DBA
> http://mgogala.freehostia.com
> †Ûiÿü0ÁúÞzX¬¶Ê+ƒün– {ú+iÉ^
i0zX+n{+i^ Received on Mon Mar 09 2015 - 17:52:29 CET

Original text of this message