ASM Disks Dropping on RHEL 6.3 - Practical Limit to Disks?

From: David Barbour <david.barbour1_at_gmail.com>
Date: Wed, 16 Apr 2014 16:34:08 -0500
Message-ID: <CAFH+ifezBOMDONEXrDR2rwG=M8=qxGTX-R7_vQNer_-AxvANaw_at_mail.gmail.com>



I was warned about a certain SAN. Regardless, what's happening now probably is not directly related, but it could be.

3-Node RAC
Dell R720
Dual Quad-Core Intel Xeon E5-2643 0 _at_ 3.30GHz 384GB RAM
GI 11.2.0.3
Database 11.2.0.3
488 ASM Disks

Yesterday the bottom fell out of our test RAC. Node2 just lost drives. While I was trying to diagnose the problem, OEM alerted that it has lost contact with Node1. When I tried to log into Node1, there was 'no route to host.' So I engaged the sys admins on that one and went back to looking at Node2. I couldn't start the failed instances, nor could I stop them. Nor could I stop crs on the Node. I should have saved the output, but the bottom line is when I ran crsctl stop crs it failed. Running srvctl stop database -d failed. So I logged into the instance and shut it down the old-fashioned way. When I fianlly got most everything stopped, I rebooted the box. Nothing came up. Here's an abbreviated output from crsctl:

/rchr1t02/

/oracle/D00 # crsctl stat res -t -init


NAME           TARGET  STATE        SERVER                   STATE_DETAILS


--------------------------------------------------------------------------------

Cluster Resources


ora.asm

      1        ONLINE  INTERMEDIATE rchr1t02                 OCR not started

ora.cluster_interconnect.haip

      1 ONLINE ONLINE rchr1t02

ora.crf

      1 ONLINE ONLINE rchr1t02

ora.crsd

      1 ONLINE OFFLINE Makes sense because the +GRID diskgroup that has the OCR didn't mount. I've been through a lot of Oracle Docs on this. ocssd.bin and evmd.bin and haip were all running. Just couldn't bring up the diskgroup (s). Oh, and while I'm struggling with this, the Sys Admin rebooted Node1 because he couldn't log in through the console either. That Node ended up in the same state as Node2. Node3 meanwhile is chugging along. At least until it was rebooted. Now I've got 3 servers up, but no ASM disk.

Messages shows stuff like: Apr 15 19:36:27 rchr1t02 udevd[13929]: worker [29550] unexpectedly returned with status 0x0100

                                          Apr 15 19:37:44 rchr1t02
udevd[13929]: worker [52527] failed while handling '/devices/virtual/block/asm!.asm_ctl_vbg2'

Red Hat suggested a workaround by upping the udev timeout and limiting the number of udev worker processes (which appears to be a function of total memory on this release) on boot. This didn't help. Eventually I was able to stop and start multipathd, reload udev rules, log in to the ASM instance (which was stuck at 'ONLINE' 'INTERMEDIATE') and mount the disk groups manually. I say eventually, because some mounted right away, others gave me permission errors but then mounted many minutes later.

Totally freakin' weird.

Has anyone experienced anything like this? I've opened an SR, but folks around here want to do a root cause analysis right now and I don't have anything to say except it appears the disks no longer mount on boot, I may or may not be able to bring them up manually, and it could happen again.

We've rebooted nodes on this RAC numerous times without incident. Why now? The Storage, Systems and Network folks swear nothing has changed. Except there was a firmware update to the DRAC. Oh, and they put a new route on the boxes to accommodate a new set of IPs we're introducing. But other than that...........

cluvfy comes back clean.

Is there a practical limit to the number of disks? I know ASM is limited to 63 diskgroups.

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Apr 16 2014 - 23:34:08 CEST

Original text of this message