Re: SunFire Server Hangs

From: MacGregor, Ian A. <ian_at_slac.stanford.edu>
Date: Tue, 10 Mar 2015 16:52:00 +0000
Message-ID: <8A7F308F-AC04-494B-8AAC-9ACD9315BFB3_at_slac.stanford.edu>



Some more information:

The machines are of two types SunFire X4250 and X4270. /etc/release on one machine is

                      Solaris 10 10/09 s10x_u8wos_08a X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 16 September 2009

and on the other two

cat /etc/release

                        Solaris 10 5/09 s10x_u7wos_08 X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                             Assembled 30 March 2009

We do apply Solaris patching quarterly.

The machine definitely tried to halt. After bringing the machine backup “last reboot” shows the system down time at the time of the "freeze. The machine continued to ping after that time. The response is the server itself not from any firewall. So despite the reported system down time at least part of the OS was up

Theere is nothing in /var/log/messages not any file which might be used as a utility to monitor the system all I/O ceased at the time of the freeze. There were no resources which were in any danger of exhaustion as far as we can tell before the freeze. But that could be because it happened too quickly to be sampled. I don’t think resource exhaustion is at all likely here.

There is a single raid controller. I thought the system disks might not under this controller, but it turns out they are. So the complete loss of the I/O system is looking more likely.

I did find we were a bit back level on the BIOS.

On Mar 9, 2015, at 12:03 PM, MARK BRINSMEAD <mark.brinsmead_at_gmail.com<mailto:mark.brinsmead_at_gmail.com>> wrote:

Agreed. This sounds likely.

If the system responds to "ping", then the OS is up and running. Unless, I suppose, the ping response is being provided by a firewall. (Network folks often do that.)

Are any other network services responsive?

In cases like the ones I have seen (fibre-channel miltipath failover initiated but not completed) the symptoms reported fit perfectly -- everything is actually "up" and working, but any processes will freeze in an uninterruptible kernel state the moment they attempt to do an IO to disk. Naturally, this would affect any processes attempting to write to log files. A review of the logfiles would lead you to the conclusion that the operating system had halted, when in fact the operating system had lost the ability to talk to the disks.

The use of "internal" disks does not necessarily preclude this sort of thing. If the system was designed for redundancy, there may be multiple IO paths (with multiple controllers) providing access to the disks.

It might be worth the trouble to ask the sysadmins and/or hardware vendor whether there is any IO multipathing going on here.

Of course, though, the problem is likely to be something else entirely. This is ONE possibility, but I sure would not try to hang my hat on it without a lot more evidence.

On Mon, Mar 9, 2015 at 12:13 PM, Andrew Kerber <andrew.kerber_at_gmail.com<mailto:andrew.kerber_at_gmail.com>> wrote: I can't figure out a mechanism that may cause this, but it sounds like the operating system is losing its path to the storage.

Sent from my iPad

> On Mar 9, 2015, at 10:21 AM, MacGregor, Ian A. <ian_at_slac.stanford.edu<mailto:ian_at_slac.stanford.edu>> wrote:
>
> It's the entire server which hangs. Once we can access the machine, after the reset though the system process, a check of the system down time shows the time the machine froze. It's like the server panicked, but did not make it all the way down as it remains pingable. When it happens all programs which might provide some information as to the cause stop. However, up to that point things look very normal indeed.
>
> All the storage is onboard. These machines accommodate 16 drives internally.
>
> None of the machines which has had this problem is clustered. They are dedicated database machines. The OS is Solaris 10.
>
> Ian MacGregor
> SLAC National Accelerator Center
>
> -----Original Message-----
> From: oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org> [mailto:oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org>] On Behalf Of Mladen Gogala
> Sent: Saturday, March 07, 2015 9:51 PM
> To: oracle-l_at_freelists.org<mailto:oracle-l_at_freelists.org>
> Subject: Re: SunFire Server Hangs
>
> Is the whole server hanging or just Oracle? Can you ssh into the server? Unfortunately Solaris is not unbreakable, like Linux.
>
> On 3/7/2015 11:49 PM, MacGregor, Ian A. wrote:
>
>
> It’s all internal storage not using ASM. The oracle version is 11.2.0.3.
>
> Ian
>
>
> On Mar 6, 2015, at 4:50 PM, Andrew Kerber <andrew.kerber_at_gmail.com<mailto:andrew.kerber_at_gmail.com>> wrote:
>
> Sounds like problem with storage to me. Is it in local storage or SAN? ASM or file system? Also, what oracle version.
>
> Sent from my iPad
>
> On Mar 6, 2015, at 6:31 PM, MacGregor, Ian A. <ian_at_slac.stanford.edu<mailto:ian_at_slac.stanford.edu>> wrote:
>
>
>
> Over the past 26 months or so we have three SunFire x86 servers hang, two, quite recently, within a few weeks of each other. The servers show no signs of high activity before the freeze None of the monitoring scripts we run indicate any problem at all before the freeze. When it happens the machine is hangs, it does ping, and can be reset through the sp.
> Looking at the boot events. There is a system downtime which matches when the freeze occurs,
>
>
> It has a very bad impact on Oracle which suffers from lost writes
>
> SLACQA_ora_13043.trc:ORA-00600: internal error code, arguments: [kcrf_resilver_log_1], [0x0E0CF5390], [2], [], [], [], [], [], [], [], [], []
>
> This error is mostly associated with power outages. In this case there was no loss of power.
>
> I reported the first machine a while back explaining automatic fail over failed to occur when a machine is not quite dead.
>
> The more recent failures have happened on machines which didn’t have a physical standby.
>
> I could not find the article about fixing this problem through applying the current redo log . All I could find were article saying the database was unrecoverable. Ensuring the backups were valid. I proceeded.
>
> The database reported a corrupt rollback segment. I switched to manual undo, made sure there were no partially availble
> segments. Created a new undo tablespace and successfully opened the database.
>
> We had one other problem an index disagreed with its table. The index was missing a row. We were able to ascertain the program which was using the index. It was a CDC job which was no longer needed. Oracle was perfectly happy with the situation. There was no reported corruption unless the affected index block was read by a query. Another problem was the index and table are bootstrap objects . We eventually used impdp/expdp to move to another machined. But I am getting ahead of myself.
>
>
> When I brought the database up with th new undo tablespace, all Oracle scheduler jobs reported they could not open a wallet. I’m not sure which wallet is referenced here. Also clients which called PL/SQL programs to open a wallet could not open a different wallet involved in authenticating to AD. However if the code was executed on the database server itself, the wallet opened without a problem. Restarting both the database and the listener fixed the problem. I have been able to find any information on this.
>
> On another database the corrupt rollback segments included some partially available ones. This was a QA database and was refreshed from backup. Again the problem with the wallet occurred.
>
> On the third database both active redo logs were corrupt. It too was recovered from backup. It also had the wallet problem.
>
> So even with no loss of power, yes the RAID cache batteries were goof. and having multiplexed redo logs. We needed to recover from backup on two databases, and on the other had a bootstrap table and index in disagreement
>
> Ian MacGregor
> SLAC National Accelerator Center
>
>
>
>
>
>
> --
> Mladen Gogala
> Oracle DBA
> http://mgogala.freehostia.com<http://mgogala.freehostia.com/>
> †Ûiÿü0ÁúÞzX¬¶Ê+ƒü n– {ú+iÉ^
--
http://www.freelists.org/webpage/oracle-l

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Mar 10 2015 - 17:52:00 CET

Original text of this message