Oracle FAQ Your Portal to the Oracle Knowledge Grid

Home -> Community -> Mailing Lists -> Oracle-L -> Re: RAC on OCFS2 acceptance testing

Re: RAC on OCFS2 acceptance testing

From: Mladen Gogala <>
Date: Sat, 30 Dec 2006 17:42:37 -0500
Message-id: <>

Comments in-line:

On 12/30/2006 12:17:57 PM, Steve Perry wrote:
> A customer ran into a simlar problem(s) with OCFS2 and RHEL4 upd 4
> (smp kernel).
> heavy db updates or mixed io (cp from ocfs to ext3, oracle export to
> ext3) would cause the cluster to become unresponsive and crash a node.
> cp and exp caused a high load avg and heavy swapping. We couldn't
> even ssh to the host.
> I didn't understand the heavy swapping because there was 3GB of cache
> mem available (shown by free -m).
> something to do with ocfs and low mem usage. I never got a clear
> answer on it.
> the ended up setting "vm.lower_zone_protection=100" which helped the
> swapping issue.

The vm.lower_zone_protection parameter makes certain portion of physical memory non-pageable. On MVS, it used to be known as "VIRTUAL=REAL boundary". Conveniently, the units are megabytes, which means that you precluded 100M of memory from being pageable. In particular, that means that OCFS kernel module will not be able to allocate user buffers from the memory below 100M boundary. The reason for that are "features" in Linux kernel, more or less openly admitted in the documentation for this parameter. Here is an excerpt from the documentation:



For some specialised workloads on highmem machines it is dangerous for the kernel to allow process memory to be allocated from the "lowmem" zone. This is because that memory could then be pinned via the mlock() system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory can be fatal.

So the Linux page allocator has a mechanism which prevents allocations which _could_ use highmem from using too much lowmem. This means that a certain amount of lowmem is defended from the possibility of being captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region. This mechanism will also defend that region from allocations which could use highmem or lowmem).

The `lower_zone_protection' tunable determines how aggressive the kernel is in defending these lower zones. The default value is zero - no protection at all.

If you have a machine which uses highmem or ISA DMA and your applications are using mlock(), or if you are running with no swap then you probably should increase the lower_zone_protection setting.

The units of this tunable are fairly vague. It is approximately equal to "megabytes". So setting lower_zone_protection=100 will protect around 100 megabytes of the lowmem zone from user allocations. It will also make those 100 megabytes unavaliable for use by applications and by pagecache, so there is a cost.

> The fencing problem was attributed to the following init.ora parms.
> filesystemio_options = asynch
> disk_asynch_io = TRUE
> they were changed to:
> disk_asynch_io=FALSE
> filesystemio_options='DIRECTIO'

Neither OCFS nor OCFS2 support asynchronous I/O. They both allow only direct I/O. By attempting to use asynchronous I/O, you may crash your system or your database. That is well documented on the OCFS site.

> Things have improved since.
> I asked Oracle for a good document for OCFS2 and RAC and still
> haven't got a response.
> I also asked for optimal kernel parameter settings for OCFS2.
> The closest I got was the following list, but no values.
> - vm.swappiness
> - vm.lower_zone_protection
> - vm.vfs_cache_pressure
> - vm.dirty_ratio
> - vm.dirty_background_ratio

Here we have to deal with the fact that Linux kernel is less then perfect, to say the least. From those parameters, swappiness and vfs_cache_pressure are so called "composite parameters" which regulate "tendency", which means that you don't get to see an accurate parameter description without plunging into the kernel code. Both of these parameters regulate "aggressiveness" of the OS with swapping/page stealing or replacing inodes and directory entries. I find them best set to 0. I was playing with the "swappiness" and I found that it will turn on aggressive page swapping which will slow down your system. Dirty ratio and background ratio are parameters for modified page write-back. Due to Linux kernel problems, you don't have any tools which would help you diagnose problems with the page write-back. You don't have anything even remotely like VMS "monitor page" and "monitor pool" commands. Linux itself is inferior to SYSVR4 Unix derivatives like AIX or HP-UX and is certainly inferior to Solaris which took a significant step ahead and removed itself from the pervasive SYSVR4 standard. Without being able to monitor effects, those parameters should be left alone.

Parameter that you should set to at least 15% of your memory, to ensure ease of memory allocation is vm.min_free_kbytes. This parameters sets the target value for the page replacement daemon to keep free and available for "malloc" calls.

> I'm not sure about "unbreakable" Oracle/Linux combo. I'd be happy if
> they focused on "stable" Oracle/Linux.

They first have to make Linux as capable as the other operating systems and add instrumentation for monitoring and diagnostics. When that is done, the first Mogens theorem will apply:
"Anything that is sufficiently instrumented is obsolete". That is the theorem from the Oak Table book.

> It comes back to "You get what you pay for". Customers think that
> Oracle spends as much money on the "freebies" (i.e. OCFS) as they do
> the database.

That has always been the case. Anybody with the right mind should expect to pay for decent things. There is a reason why Cadillac costs more then a Yugo.

Mladen Gogala

Received on Sat Dec 30 2006 - 16:42:37 CST

Original text of this message