Re: ASM Disks Dropping on RHEL 6.3 - Practical Limit to Disks? (Solved)

From: David Barbour <david.barbour1_at_gmail.com>
Date: Fri, 18 Apr 2014 11:37:37 -0500
Message-ID: <CAFH+ifeMfeLUuVUsgKPazWGHyCxJYFoVGY=FK-S7erfvCSd03w_at_mail.gmail.com>



This was interesting. We had a contractor install an application database on the test RAC. I don't think they understood what they were doing as it appears they tried to create diskgroups like filesystems. In any event, they ran into the 63 diskgroup limit for ASM. When you create a diskgroup and register it, it puts an entry into the spfile parameter asm_diskgroups. When they got the error, they dropped the diskgroups and created new ones so they didn't run into the limit. But the drop didn't remove the relevant entries from the spfile. There were 64 diskgroups listed in the spfile. Which to start with, I wouldn't think was possible. Included in the listing were the diskgroups they dropped. Some of the diskgroups they created after the drop were in the spfile list, and some weren't I think it's likely they may have received a registration error on those that weren't in the list and just mounted them. Regardless, some of the ASM diskgroups - including the one containing ACFS volumes with the OCR - were 'pushed off' (for lack of a better phrase) the spfile asm-diskgroups listing.

Is this a bug, or is there some additional step that needs to occur when you drop a diskgroup to remove it from the spfile? And why would there be 64 entries anyway when the limit is 63?

On Wed, Apr 16, 2014 at 4:34 PM, David Barbour <david.barbour1_at_gmail.com>wrote:

> I was warned about a certain SAN. Regardless, what's happening now
> probably is not directly related, but it could be.
>
> 3-Node RAC
> Dell R720
> Dual Quad-Core Intel Xeon E5-2643 0 _at_ 3.30GHz
> 384GB RAM
> GI 11.2.0.3
> Database 11.2.0.3
> 488 ASM Disks
>
> Yesterday the bottom fell out of our test RAC. Node2 just lost drives.
> While I was trying to diagnose the problem, OEM alerted that it has lost
> contact with Node1. When I tried to log into Node1, there was 'no route to
> host.' So I engaged the sys admins on that one and went back to looking at
> Node2. I couldn't start the failed instances, nor could I stop them. Nor
> could I stop crs on the Node. I should have saved the output, but the
> bottom line is when I ran crsctl stop crs it failed. Running srvctl stop
> database -d failed. So I logged into the instance and shut it down the
> old-fashioned way. When I fianlly got most everything stopped, I rebooted
> the box. Nothing came up. Here's an abbreviated output from crsctl:
>
> /rchr1t02/
>
> /oracle/D00 # crsctl stat res -t -init
>
>
> --------------------------------------------------------------------------------
>
> NAME TARGET STATE SERVER STATE_DETAILS
>
>
> --------------------------------------------------------------------------------
>
> Cluster Resources
>
>
> --------------------------------------------------------------------------------
>
> ora.asm
>
> 1 ONLINE INTERMEDIATE rchr1t02 OCR not
> started
>
> ora.cluster_interconnect.haip
>
> 1 ONLINE ONLINE rchr1t02
>
> ora.crf
>
> 1 ONLINE ONLINE rchr1t02
>
> ora.crsd
>
> 1 ONLINE OFFLINE
>
>
> Makes sense because the +GRID diskgroup that has the OCR didn't mount.
> I've been through a lot of Oracle Docs on this. ocssd.bin and evmd.bin and
> haip were all running. Just couldn't bring up the diskgroup (s). Oh, and
> while I'm struggling with this, the Sys Admin rebooted Node1 because he
> couldn't log in through the console either. That Node ended up in the same
> state as Node2. Node3 meanwhile is chugging along. At least until it was
> rebooted. Now I've got 3 servers up, but no ASM disk.
>
> Messages shows stuff like: Apr 15 19:36:27 rchr1t02 udevd[13929]: worker
> [29550] unexpectedly returned with status 0x0100
> Apr 15 19:37:44 rchr1t02
> udevd[13929]: worker [52527] failed while handling
> '/devices/virtual/block/asm!.asm_ctl_vbg2'
>
> Red Hat suggested a workaround by upping the udev timeout and limiting the
> number of udev worker processes (which appears to be a function of total
> memory on this release) on boot. This didn't help. Eventually I was able
> to stop and start multipathd, reload udev rules, log in to the ASM instance
> (which was stuck at 'ONLINE' 'INTERMEDIATE') and mount the disk groups
> manually. I say eventually, because some mounted right away, others gave
> me permission errors but then mounted many minutes later.
>
> Totally freakin' weird.
>
> Has anyone experienced anything like this? I've opened an SR, but folks
> around here want to do a root cause analysis right now and I don't have
> anything to say except it appears the disks no longer mount on boot, I may
> or may not be able to bring them up manually, and it could happen again.
>
> We've rebooted nodes on this RAC numerous times without incident. Why
> now? The Storage, Systems and Network folks swear nothing has changed.
> Except there was a firmware update to the DRAC. Oh, and they put a new
> route on the boxes to accommodate a new set of IPs we're introducing. But
> other than that...........
>
> cluvfy comes back clean.
>
> Is there a practical limit to the number of disks? I know ASM is limited
> to 63 diskgroups.
>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Apr 18 2014 - 18:37:37 CEST

Original text of this message