Re: RAC on OCFS2 acceptance testing

From: Steve Perry <sperry_at_sprynet.com>
Date: Sat, 30 Dec 2006 11:17:57 -0600
Message-Id: <D9BDA7E7-229D-463D-8FBC-8C689F4DFCFB@sprynet.com>

A customer ran into a simlar problem(s) with OCFS2 and RHEL4 upd 4 (smp kernel).
heavy db updates or mixed io (cp from ocfs to ext3, oracle export to ext3) would cause the cluster to become unresponsive and crash a node. cp and exp caused a high load avg and heavy swapping. We couldn't even ssh to the host.
I didn't understand the heavy swapping because there was 3GB of cache mem available (shown by free -m).
something to do with ocfs and low mem usage. I never got a clear answer on it.

the ended up setting "vm.lower_zone_protection=100" which helped the swapping issue.

The fencing problem was attributed to the following init.ora parms.

filesystemio_options     = asynch
disk_asynch_io           = TRUE

they were changed to:
disk_asynch_io=FALSE
filesystemio_options='DIRECTIO'

Things have improved since.

I asked Oracle for a good document for OCFS2 and RAC and still haven't got a response.
I also asked for optimal kernel parameter settings for OCFS2.

The closest I got was the following list, but no values.

- vm.swappiness
- vm.lower_zone_protection
- vm.vfs_cache_pressure
- vm.dirty_ratio
- vm.dirty_background_ratio

I'm not sure about "unbreakable" Oracle/Linux combo. I'd be happy if they focused on "stable" Oracle/Linux.

It comes back to "You get what you pay for". Customers think that Oracle spends as much money on the "freebies" (i.e. OCFS) as they do the database.

my 2¢

P.S. I spend as much time on Bugzilla as I do metalink these days.

On Dec 28, 2006, at 11:14 AM, Kevin Closson wrote:

>
> And to point out that I'm not being obtuse,
> here is a snippet from
> http://oss.oracle.com/bugzilla/show_bug.cgi?id=822 :
>
>
> Environment:
> Linux x86-64 Redhat 4.0 Update 3
> OCFS2 1.2.3 3-node cluster.
> Problem:
> After installation, created two filesystems to be used for
> software.
> To limit timeout problems, increased the
> O2CB_HEARTBEAT_THRESHOLD TO
> 31.
>
> During maintenance window, decided to use the OCFS2 filesystem
> to store a large backup file (about 5-10 gig file).
> SCP'ed the file from an outside server to node1 of the cluster
> using command "scp $file oracle_at_sachlp10:/ocfs2_fs1/.
>
> After a few minutes, node1 crashed.
> Did not find error messages on node1, but found them in
> /var/log/messages
> on node2:
>
> ...wow, sounds like a pretty aggressive workload, right?
> --
> http://www.freelists.org/webpage/oracle-l
>
>

--
http://www.freelists.org/webpage/oracle-l

Received on Sat Dec 30 2006 - 11:17:57 CST