RE: Linux Native Multipath, ASM and Instance Failures

From: CRISLER, JON A <JC1706_at_att.com>
Date: Thu, 26 Jul 2012 22:08:19 +0000
Message-ID: <9F15274DDC89C24387BE933E68BE3FD3347ECD_at_MISOUT7MSGUSR9D.ITServices.sbc.com>



We used 3par extensively and had few problems in that area- you want to check your multipath.conf to make sure it has all the right options. The time it takes to perform the failover might be outside of the window that ASM and multipath allows a disk i/o to be suspended (which I think is 70 seconds for clusterware, and might be tunable, AND I might be munging up that whole concept :) ). In Netapp FC we have had similar issues and it usually was a serious tweak to the multipath.conf to resolve it. There are also known bugs in the device mapper and multipath rpms so make sure you are up to date in that area. Its been a while since I looked at this stuff, and I am not a storage guru. We did use ASMlib because it just makes life easier overall, and the scanorder was /dev/dm*. I would suggest opening a ticket with RedHat Support specifying a multipath issue and see what they suggest- they were very helpful in our case. Your lucky you are on 5.8 because RH 4 and previous ver  sions had more obscure issues of that type. You don't specify which FC driver and adapter vendor you are using: sometimes firmware updates are helpful, and if qlogic, there is the RH supplied and the qlogic supplied driver choice as well. We used the RH supplied driver but I always had suspicions that the qlogic driver might be better.

-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Radoulov, Dimitre Sent: Thursday, July 26, 2012 2:29 PM
To: cicciuxdba_at_gmail.com
Cc: oracle-l-freelists
Subject: Re: Linux Native Multipath, ASM and Instance Failures

Hi Guillermo,

On 26/07/2012 18:44, Guillermo Alan Bort wrote:
> We have the following set up:
>
> 1. RHEL 5.8 (standard RH kernel)
> 2. Oracle RAC 11.2.0.3 (Jan PSU)
> 3. Linux Native Multipath (/dev/mapper)
> 4. 3PAR storage (don't know much about the storage layer, though).
> 5. NO ASMLIB is used, the asm diskstring is /dev/mapper/*p1
>
> We were running some redundancy tests (pulling cables and seeing
> what
> happens) and when the servers lost a path, the instances crashed. I'm
> still gathering logs, but OS errors looks like this:
> Jul 26 09:40:41 tvl-p-orep001 kernel: end_request: I/O error, dev
> sdbg, sector 4151

[...]
> and then
>
> Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath:
> Failing path 65:192.

[...]
> In the meantime ASM logs show this:
>
> WARNING: Read Failed. group:0 disk:22 AU:0 offset:0 size:4096 Errors
> in file
> /u01/ORAUTL/grid/base/diag/asm/+asm/+ASM/trace/+ASM_ora_18784.trc:
> ORA-27061: waiting for async I/Os failed
> Linux-x86_64 Error: 5: Input/output error Additional information: -1
> Additional information: 4096

Just for your information:
we have no problems with RHEL 5.7 (RH kernel), RAC 11.2.0.3.2, 3PAR _and_ ASMLib. We did the same tests and we had no problems (there were messages for the failing paths in the OS logs [as expected], but the Oracle stack remained up and running (no error messages at all in the various alert logs).

If I recall correctly some MOS notes suggest to set ORACLEASM_SCANORDER to dm (/dev/dm-* as opposed to /dev/mapper/* ). As far as I know the fact that the names dm-* are not persistent shouldn't be a problem when clusterware files (voting/ocr) are in ASM disk groups (11.2). I would try to set asm_diskstring to /dev/dm-* and then I would repeat the tests.

Regards
Dimitre

--
http://www.freelists.org/webpage/oracle-l


--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jul 26 2012 - 17:08:19 CDT

Original text of this message