Re: Unix Filesystem vs Raw Partition

From: Paul Zola <pzola_at_us.oracle.com>
Date: 9 Oct 1994 22:58:57 GMT
Message-ID: <379sji$j4f_at_dcsun4.us.oracle.com>


In article <3758ib$qii_at_ixnews1.ix.netcom.com> wdm2_at_ix.netcom.com (Henry Kwok) writes:
} Using Unix filesystem the data may still be
} in buffer while the DBMS thinks they are already written to disk. If
} the system crashes before the data is written out from buffer to disk,
} there may be serious data integrity problem. I've been to Sybase class
} and they all recommend using raw partition instead of file system
} because of this problem, not because of the gain in performance. Does
} Oracle architecture some how avoid this problem ?

Yes, it does. UNIX variants provide one of two ways to handle this problem: the O_SYNC file descriptor flag, and the fsync() system call. (The former is ususally found in System V-based systems, while the latter is usually found on BSD-based systems.) If a file descriptor has the O_SYNC flag set on it (either at open() time or by the fcntl() system call) then the UNIX buffer cache works as a write-through cache rather than a write-back cache. Applying the fsync() system call to a file descriptor causes all the dirty buffers in the UNIX buffer cache that belong to the file that the file descriptor has open to be written to disk by the time the system call returns. (Consult the Stevens book for further details.)

The Oracle RDBMS uses one of these two system calls to guarantee that the data associated with a transaction has "hit the disk" by the time that the COMMIT returns. (I believe that Oracle will use the O_SYNC flag instead of the fsync() system call on platforms that support both interfaces.) On platforms that support asynchronous I/O, the Oracle kernel uses the "write succeeded/write failed" facilities of the asynch I/O subsystem to achieve the same result as turning on the O_SYNC flag.

In any case, data integrity is guaranteed, providing there are no bugs with the OS vendor's implementation of O_SYNC, fsync() or asynch I/O.

Does Sybase really have problems with data integrity if using filesystem files instead of raw devices? (I have absolutely *no* knowledge about the Sybase implementation.) If so, it seems to me like a pretty glaring oversight, or perhaps (more charitably) a deliberate design decision.

If, however, anyone tells you that UNIX doesn't support reliable write operations on filesystem files, they are Just Plain Wrong.


Paul Zola                              pzola_at_oracle.com
Senior Technical Analyst -- Worldwide Technical Support Received on Sun Oct 09 1994 - 23:58:57 CET

Original text of this message