NFS hanging on AIX

From: Allen, Brandon <Brandon.Allen_at_OneNeck.com>
Date: Tue, 16 Feb 2010 11:02:42 -0700
Message-ID: <64BAF54438380142A0BF94A23224A31E11346EBD5B_at_ONEWS06.oneneck.corp>



Hi list, just a heads up on an issue we've been struggling with for a couple months that we finally figured out. We're running Oracle 9.2.0.8 on AIX 5.3, with Oracle's files on NFS running on an EMC Celerra NS500, OS version 5.5.23-2, and have been running this system for years with no problems. A couple months ago, we started having weird problems where the database would just freeze up when trying to access just one datafile. It wouldn't return any errors, but would just hang indefinitely on anything that tried to access the datafile. Other times, it wasn't a datafile, but a log file or controlfile that got locked up, so each time the symptoms were a little different. Usually our first indication of the problem was when a full backup would hang, since it of course scans the entire database, it was likely the first process to encounter the bad file. Once the problem occurred, we weren't able to access the file at all - if we tried to view it with more/cat/tail/head, or run dbv on it, our process would just hang, and couldn't even be killed with kill -9. Oracle wouldn't shutdown normally, had to abort, and then Oracle wouldn't start back up because it couldn't access the file anymore either, so we had to reboot the host and then everything came up fine and we were able to scan the problem file with dbv and found no problems. Then everything would go along just fine for a week or two until the same thing happened again. The last time the problem occurred, we forced a panic on the primary data mover in the NAS, causing it to fail over to the secondary, and then the problem immediately resolved and we were able to access the previously inaccessible files again without having to reboot the host. Today we just got a known bug description from EMC support that looks like an exact match for our problem. Apparently there is a problem with stale XID cache entries on the data mover for NFS clients on AIX and a new parameter "xidreset" has been added to deal with it in 5.5.41.1. We'll be upgrading this weekend.

Regards,
Brandon



Privileged/Confidential Information may be contained in this message or attachments hereto. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of this company shall be understood as neither given nor endorsed by it.
--
http://www.freelists.org/webpage/oracle-l
Received on Tue Feb 16 2010 - 12:02:42 CST

Original text of this message