RE: 64 node Oracle RAC Cluster (The reality of...)

From: Kevin Closson <kevinc_at_polyserve.com>
Date: Wed, 22 Jun 2005 13:42:44 -0700
Message-ID: <B9782AD410794F4687F2B5B4A6FF3501FAA9CF@ex1.ms.polyserve.com>

>Kevin,
>I don't know for the others... but I'd like to keep reading
>this thread and how it is evolving.
>The discussion is interesting.
>
>Fabrizio

This thread gets more traction in this forum than suse-oracle as Fabrizio and I can attest. It seems over there that any platform software, regardless of quality, is best as long as it is free and open source...which I find particularly odd when choosing a platform to host the most expensive (and most feature rich) closed source software out there (Oracle). hmmm...

So the thread is a technical comparison of cluster filesystem architectures. Or at least a tip-toe through the tulips on horseback.

One camp being the central locking and metadata approach of the IBM GPFS, Sistina GFS, Veritas CFS camp versus the fully symmatric, distributed approach implemented by PolyServe on Linux and Windows.

The central approach is the easiest approach. Period. That does not make them useless. On the contrary, they are extremely good (better than PolyServe) at HPC workloads. When you compare more commerical-style workloads, like email, the distributed, symmetric approach bears fruit. Workloads like email are great for making the point of whether a CFS is
general purpose and what isn't. See the following URL of an independent test of an email system for hundreds of thousands of users comparing the various CFS technology out there (for Linux):

http://www.polyserve.com/pdf/Caspur_CS.pdf http://www.linuxelectrons.com/article.php/20050126205113614

Mladen asked about such intricasies as versioning and such. There is no such concept on the table. A CFS is responsible for keeping filesystem metadata coherent, applications are responsible for keeping file content coherent. Now, having said that, PolyServe supports positional locking and we do also maintain page cache coherency on a per-file granularity. So, if two processes in the cluster use a non-cluster-aware program, like vi, and set out to edit the same file in the CFS, the result will be that the last process to write the file will be the winner. This is how vi works on a non-CFS, so this should be expected.

Oracle file access characteristics are an entirely different story. Here, the application is cluster-aware so we've implemented a mount option for direct IO (akin to forcedirectio mount option in Solaris). Here, the IO requests are DMAed directly from the address space of the process to disk - without serialization or inode updates like [ma]time. The value add that we implented, however, is what sets this approach apart.

In the same filesystem, that is mounted with the direct IO option, you can have one process performing properly aligned Ios going through the direct IO path (e.g., lgwr) while another process is doing unaligned buffered IO. This comes in handy, for instance, when you
have a process like ARCH spooling the archived redo logs (direct IO) followed by compress/gzip compressing down the file. Tools like compress nearly always produce an output file that is not a multiple of 512 bytes, so for that reason alone it cannot use direct IO on any SCSI based system. Lot's of stuff to consider in making a comprehensive cluster platform for databases...

The concerns of a good CFS being able to handle text-mapping is not an issue. The following example is a small 10 node PolyServe Matrix (cluster). The test consists of first comparing 1000 executions of the Pro*C executable comparing to a non-CFS (reiserfs in this case).

First, prove that the test binary (proc in this case) is the same inode in the CFS on all 10 nodes:

$ for i in 1 2 3 4 5 6 7 8 9 10; do rsh mxserv$i "ls -i

$ORACLE_HOME/bin/proc"; done
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc

Next, copy the proc executable to /tmp to get baseline non-CFS (reiserfs) to PolyServe CFS comparison:

$ cp $ORACLE_HOME/bin/proc /tmp
$ md5sum $ORACLE_HOME/bin/proc /tmp/proc af42f080f2ddba7fe90530d15ac1880a
/u01/app/oracle10/product/10.1.0/db_1/bin/proc af42f080f2ddba7fe90530d15ac1880a /tmp/proc $

Next, a quick script to fire off 1000 concurrent invocations of the binary pointed to by arg1

$ cat t_proc
#!/bin/bash

binary=$1
getenv=$2

[[ ! -z "$2" ]] && cd ~oracle && . ./.bash_profile

cnt=0
until [ $cnt -eq 1000 ]
do

        (( cnt = $cnt + 1 ))
        ( $binary sqlcheck=FULL foo.pc > /dev/null 2>&1) &

done
###End script

Next, execute the script under time(1) to get count of minor faults and execution time. When executed as /tmp/proc, the cost is 1020884 minor faults and 11.6 total complete time.

$ /usr/bin/time ./t_proc /tmp/proc
11.60user 10.42system 0:11.72elapsed 187%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1020884minor)pagefaults 0swaps

Next, execute the script pointing to the Shared Oracle Home copy of the proc executable:

$ echo $ORACLE_HOME
/u01/app/oracle10/product/10.1.0/db_1
$ /usr/bin/time ./t_proc $ORACLE_HOME/bin/proc 11.43user 10.52system 0:11.08elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1016753minor)pagefaults 0swaps

So, 1000 invocations parallelized as much as a dual proc system can muster yields the same execution performance on non-CFS as CFS.

Next, execute the script in parallel on 2,4 and then 10 nodes in parallel. Note, the timing granularity is seconds using the $SECONDS builtin variable.

$ cat para_t_proc
for i in 1 2
do
rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" & done
wait

echo $SECONDS

for i in 1 2 3 4
do
rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" & wait

echo $SECONDS

for i in 1 2 3 4 5 6 7 8 9 10
do
rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" & done
wait

echo $SECONDS

$ sh ./para_t_proc
11
22
34

So, parallel and cluster-concurrent execution of bits is 100% linear scalable...as it should be. Otherwise, as I've ranted before, you would not be able to call it a CFS, or an FS at all for that matter :-)

--
http://www.freelists.org/webpage/oracle-l

Received on Wed Jun 22 2005 - 16:48:11 CDT