Linux Native Multipath, ASM and Instance Failures

From: Guillermo Alan Bort <cicciuxdba_at_gmail.com>
Date: Thu, 26 Jul 2012 13:44:12 -0300
Message-ID: <CAJ2dSGT4DNBYN8E+Pe9K3+KYnWg6mfnRO9WhZsPx6mhxA36gQw_at_mail.gmail.com>



Hi,
  I have an SR with Oracle for this, but perhaps some of you have encountered this issue before.

  We have the following set up:

  1. RHEL 5.8 (standard RH kernel)
  2. Oracle RAC 11.2.0.3 (Jan PSU)
  3. Linux Native Multipath (/dev/mapper)
  4. 3PAR storage (don't know much about the storage layer, though).
  5. NO ASMLIB is used, the asm diskstring is /dev/mapper/*p1

  We were running some redundancy tests (pulling cables and seeing what happens) and when the servers lost a path, the instances crashed. I'm still gathering logs, but OS errors looks like this: Jul 26 09:40:41 tvl-p-orep001 kernel: end_request: I/O error, dev sdbg, sector 4151
Jul 26 09:40:41 tvl-p-orep001 kernel: sd 2:0:0:49: SCSI error: return code
= 0x00010000

Jul 26 09:40:41 tvl-p-orep001 kernel: end_request: I/O error, dev sdcu, sector 2868761111
Jul 26 09:40:41 tvl-p-orep001 kernel: sd 2:0:0:49: SCSI error: return code
= 0x00010000

Jul 26 09:40:41 tvl-p-orep001 kernel: end_request: I/O error, dev sdcu, sector 2868762711
Jul 26 09:40:41 tvl-p-orep001 kernel: sd 2:0:0:31: SCSI error: return code
= 0x00010000

and then

Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 65:192.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 65:208.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 65:224.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 66:16.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 66:32.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 66:80.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 66:96.
Jul 26 09:40:43 tvl-p-orep001 kernel: device-mapper: multipath: Failing path 66:144.

....

Jul 26 09:40:51 tvl-p-orep001 multipathd: sdac: tur checker reports path is down
Jul 26 09:40:51 tvl-p-orep001 multipathd: checker failed path 65:192 in map <DB>_fg1_data_14

Jul 26 09:40:51 tvl-p-orep001 multipathd: ghtgmp_fg1_data_14: remaining
active paths: 1
Jul 26 09:40:51 tvl-p-orep001 multipathd: sdad: tur checker reports path is
down
Jul 26 09:40:51 tvl-p-orep001 multipathd: checker failed path 65:208 in map <DB> _fg1_data_15
Jul 26 09:40:51 tvl-p-orep001 multipathd: ghtgmp_fg1_data_15: remaining
active paths: 1
Jul 26 09:40:51 tvl-p-orep001 multipathd: sdae: tur checker reports path is
down
Jul 26 09:40:51 tvl-p-orep001 multipathd: checker failed path 65:224 in map <DB> _fg1_data_16
Jul 26 09:40:51 tvl-p-orep001 multipathd: ghtgmp_fg1_data_16: remaining active paths: 1

In the meantime ASM logs show this:

WARNING: Read Failed. group:0 disk:22 AU:0 offset:0 size:4096 Errors in file
/u01/ORAUTL/grid/base/diag/asm/+asm/+ASM/trace/+ASM_ora_18784.trc: ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:1 disk:2 AU:0 offset:0 size:4096 Errors in file
/u01/ORAUTL/grid/base/diag/asm/+asm/+ASM/trace/+ASM_ora_18784.trc: ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:1 disk:1 AU:0 offset:0 size:4096 NOTE: Assigning number (2,0) to disk
(/dev/oracleasm/disks/<DB>_FG1_REDOA_01) NOTE: ASM client <DB> disconnected unexpectedly. NOTE: ASM client <DB> disconnected unexpectedly.

I've taken a look in MOS and found a few notes of worth: Oracle ASM and Multi-Pathing Technologies [ID 294869.1] <--- would seem to indicate device mapper is supported by ASM Database Instance Crashes In Case Of Path Offlined In Multipath Storage [ID 555371.1] <--- Deals with ASMLib, so not really our particular test case. Configuration and Use of Device Mapper Multipathing on Oracle Enterprise Linux (OEL) [ID 555603.1] <--- Interesting note, not dealing with RHEL but OEL is fairly similar. We have the path_grouping_policy different. The note recommends setting it to failover and we have multibus, not sure this is the issue though.

The other notes I found were of no relevance to this issue.

Thanks in advance for any input

Cheers
Alan.-

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jul 26 2012 - 11:44:12 CDT

Original text of this message