storage and filesystems on linux (ocfs2, ASM...)

From: Li Li <litanli_at_gmail.com>
Date: Mon, 13 Jun 2011 22:02:04 -0500
Message-ID: <BANLkTi=8Ey_a-bG8ghh+cQF9-vboUP8N4Q_at_mail.gmail.com>



Maybe you can look at your fabric zoning ... We had a RAC nodes (with both OCFS2 and ASM) reboot issue 3 years ago, and the issue turned out to be fabric zoning. The HBAs and SPs were originally zoned correctly (even though it's recommended to zone as 1 initiator to 1 target, but some people tend to be lazy so it's zoned as 1 initiator to 2 targets, which still works), back then we were implementing a new enterprise backup product, one of our storage guy wanted to be lazy again and he added the media agent HBAs to the RAC zones, so the zones became many initiator to many targets, which is simply wrong. So any time when there was a zone change on the switch, it caused timeout between storage and RAC nodes, thus caused nodes reboot. Once this zoning issue was corrected, everything went back to normal.

On Monday, June 13, 2011, Jonathan Smith <smithj_at_alaska.edu> wrote:
> Thanks for the reply.
>
> We have Oracle, HP (hardware, SAN switches, HP EVA), and Red Hat (OS) tickets open. So far all are pointing fingers at each other.
>
> Our other systems also use qlogic, but ocfs2 is the only one which panics due to a rescan. Other filesystems notice the event, write out a warning in the log, and continue to function. The rescans are happening because of a routine fabric event: adding a path, losing a path, new device zoned in, etc.
>
> When doing the individual node-local installs, how do you make sure everything is in-sync?
>
>         Jonathan Smith
>
> On 06/13/2011 01:32 PM, D'Hooge Freek wrote:
>> Normally we use ASM for the database and install the Oracle binaries on local disks (no shared homes). NFS volumes are used for filesystems that need to be shared between the nodes. In 11g you could also use ACFS for that (no real experience with that).
>>
>> The other option we use is NFS for the database as well.
>> In 11g with dNFS we get some very good performance by combining multiple network interfaces.
>>
>> But OCFS2 should work.
>> It is very strange that your qlogic driver rescans the fabric, and ASM will not protect you against this. So the first thing I would do is to investigate what is causing this rescanning.
>>
>>
>> Regards,
>>
>>
>> Freek D'Hooge
>> Uptime
>> Oracle Database Administrator
>> email: freek.dhooge_at_uptime.be
>> tel +32(0)3 451 23 82
>> http://www.uptime.be
>> disclaimer: www.uptime.be/disclaimer
>>
>> -----Original Message-----
>> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Jonathan Smith
>> Sent: maandag 13 juni 2011 22:34
>> To: oracle-l_at_freelists.org
>> Subject: storage and filesystems on linux (ocfs2, ASM...)
>>
>> We currently run ocfs2 1.4.x on RHEL 5, and we've been having huge problems. Our four-node cluster is simultaneously rebooting roughly once a week (sometimes more, sometimes less). The qlogic driver periodically rescans the fabric, and blocks IO while that happens. Sometimes it exceeds the ocfs2 tiemout, and cluster goes kaboom. We've increased the timeout values to absurd levels (more than a minute), and somehow it still happens.
>>
>> So, my question is what other folks are doing for cluster storage on RHEL. We need the ability to share the database files as well as oracle software installs.
>>
>> I know ASM is an option, but we haven't investigated it yet. I think is instance-specific, and thus can't be used for the oracle installed files? Other options might be GFS2 (Red Hat's cluster filesystem), Veritas, or something else entirely.
>>
>> What are you folks using, and how do you like it?
>>
>>       Jonathan Smith
>>
>> --
>> http://www.freelists.org/webpage/oracle-l
>>
>>
>
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Jun 13 2011 - 22:02:04 CDT

Original text of this message