Re: ZFS or UFS? Solaris 11 or better stay with Solaris 10?

From: kyle Hailey <kylelf_at_gmail.com>
Date: Fri, 30 Mar 2012 14:38:00 -0700
Message-ID: <CADsdiQjg9wHKMY9pxCe6-HUzJAXqu2uQ6ULkB48ndF9KogZyfA_at_mail.gmail.com>

The Oracle paper is good.
I'm involved exclusively with Oracle databases on ZFS, so in general it works well. I've listed below the issues I've run into and the solutions for each of them. There are many cool options that are only available on ZFS. The ZFS community is quite active in improving, fixing and adding functionality. There is quite a bit going on in the way it works. I'm still quite new to all the possibilities in ZFS.

A few things to keep in mind:

For best write performance the pool should be less than 80% free.

ZFS by default double writes, similar to Oracle. Oracle writes to Redo and the datafile. Similarly ZFS writes to the ZIL (redo) and the files themselves. This can double the amount of writes which can be confusing when benchmarking the I/O. This can be modified with logbias as shown in the Oracle paper. Set the Oracle datafiles to throughput and the metadata will get written to the ZIL (like nologging operations in Oracle) but the data will be written to the data files. Put the redo in latency logbias mode so it gets committed through the ZIL. This is faster but will cause double writes.

My experience is that ZFS read ahead and caching works well and especially for DSS queries. I can't say for sure but it seems ZFS is aggressive with read ahead and caching. If by chance you want less read ahead, ie you are only doing random 8k reads, you can turn read ahead off with zfs_prefetch_disable
http://forums.freenas.org/archive/index.php/t-1076.html

Two rare but problematic issues come to mind

the ZFS ARC (like file system cache) stopped caching and spent all it's time kicking pages out. There is a fix, but not sure if it's out the open source community yet. To monitor, run echo '::arc ! egrep -w "c_min|c_max|size|arc_no_grow"' | pfexec mdb -k the problem manifests by a arc_no_grow set to 1 (ie don't grow it) and the arc size been well under the available memory. Seen this happen 4 times on maybe 100s of systems If happens then it might require a reboot.
Write throughput dropped drastically after a flurry of disk errors. Disk errors were the core problem but it turns out that ZFS in became a bit overly protective, throttled writes down too far. Write throttling can be turned off with zfs_no_write_throttle Seen this happen once, but was quite confusing at the time. you should be able to monitor what ZFS thinks the write speed is with dtrace: dsl_pool_sync:entry /stringof(args[0]->dp_spa->spa_name) == "domain0"/ { self->dp = args[0]; }

dsl_pool_sync:return
/self->dp/
{

        printf("write_limit %d, write_throughput %d\n",
            self->dp->dp_write_limit,
            self->dp->dp_throughput);
        self->dp = NULL;

}

Other things to be aware of, is that ZFS scrub can show up as a lot of reads when the filesystem would otherwise be idle. The ZFS scrub should back off strongly when user load comes on to give priority to other I/O. The scrub can be turned off with
zpool scrub -s poolname
The should only be temporary as its crucial that scrub gets run regularly (like weekly)

There is no direct I/O per say. Data will get cached in the ARC. If by chance you want to turn off caching and simulate direct I/O (not suggesting this but it's useful for testing of the actual back end disks) you set caching off:
zfs set primarycache=none poolname
Note that this will still leave things cached that are already in the cache. You'd have to export the pool to clear the cache of an existing pool.

ZFS will also call to the disks to flush them if the have cache like an internal array usually has. This can cause problems if that cache is battery backed and it interprets the flush as a force write to disk. The call to flush can be turned off with
zfs_nocacheflush
I think the Oracle paper discusses this.

The parameters you want to set per pool are

compression - on/off, up to you
logbias - latency for redo, throughput datafiles
recordsize - blocksize for datafiles, 128K for others. Oracle paper gives recommendations
primarycache - all except archive (and UNDO) set to metadata
secondarycache - all for datafiles, none for others (probably)

ZFS also bases calculations on LUNs so if your give a few LUNs to ZFS that represent many back end spindles, some of the I/O queue calculations can be off. I believe the Oracle paper goes into this. Never been a problem that I've directly seen though heard about it.

As in all I/O systems block alignment is important. Unfortunately that is the point in this list I'm weakest on.

Comments, additions, corrections welcome as this is all new to me :)

Kyle

On Thu, Mar 29, 2012 at 9:49 AM, GG <grzegorzof_at_interia.pl> wrote:

> W dniu 2012-03-28 15:34, De DBA pisze:
> > G'day,
> ZFS is very mature and database friendly filesystem as long as You
> follow the rules :) mentioned here
> http://developers.sun.com/solaris/docs/wp-oraclezfsconfig-0510_ds_ac2.pdf
> :)

> Regards
> GregG

> --
> http://www.freelists.org/webpage/oracle-l

>
>
>

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Mar 30 2012 - 16:38:00 CDT

This message: [ Message body ]
Next message: Carol Dacko: "Re: redo logs in noarchivelog mode"
Previous message: Dennis Williams: "Re: dbms_redefinition using rowid and system tablespace"
In reply to: GG: "Re: ZFS or UFS? Solaris 11 or better stay with Solaris 10?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message