Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path Failures

From: <fmhabash_at_gmail.com>
Date: Mon, 13 Jun 2016 13:04:48 -0400
Message-ID: <575ee7b2.4c54240a.aebc3.3fe1_at_mx.google.com>



We are experiences a perplexing issue that we have not been able to arrive at an RCA resolution. Grid nodes (can be RAC or standalone) boot unexpectedly & sporadically (not every time) when we failover a hardware component such as UCS fabric interconnect, an HBA, or a storage controller. On some systems, we also noticed filesystems going read-only.

All devices are configured with multipathing of minim of 4 paths. Multipathing is offered via EMC PowerPath or Native Linux DM-MPIO.

All nodes use 11gR2 ASM LVM, with subset using ASMLIB running on OEL 6.3-6.6 and RDBMS 11gR2

I know there is a zillion factors to consider here, but to make things simple, let’s focus on dm-mpio for now. We believe, all these symptoms related to how the software (oracle ASM or Linux LVM) reacts to the loss of a path in a multipathed setup. So we focused on multipath.conf settings that control IO path failover. Namely …

Path_retry
Queue_if_no_path
Polling_interval
Rr_min_io
Failback immediate

  1. Have you experienced issues like unexpected node reboots, filesystems going read-only when failing over at the hardware level I listed above?
  2. What was you resolution.
  3. How does your multipath.conf parameters listed above compare to yours?

Thanks all

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Jun 13 2016 - 19:04:48 CEST

Original text of this message