Re: Tim Gorman's "...Cost-Based Optimizer.doc"

From: Mladen Gogala <mgogala_at_allegientsystems.com>
Date: Thu, 16 Jun 2005 11:56:56 -0400
Message-ID: <42B1A148.8020200@allegientsystems.com>

Marquez, Chris wrote:

>Tim Gorman's "The Search for Intelligent Life in the Cost-Based
>Optimizer.doc"
>
>
>One comment / question on OPTIMIZER_INDEX_COST_ADJ;
>It seems that on "NON-cached" filesystems (e.g. RAW, OCFS) that there
>would/do NOT be a great difference in "db file scattered reads" vs. "db
>file sequential reads" AVERAGE_WAITS...as every read from disk (on
>non-cached filesystem) is a *real* read from dusk...no OS buffer to
>help, no?
>
>Thanks,
>
>Chris Marquez
>Oracle DBA
>
>

Chris, things aren't that simple. To explain why, I must show the following:

NAME readv, writev - read or write a vector

SYNOPSIS *#include <sys/uio.h <http://www.die.net/doc/linux/include/sys/uio.h>>*

*int readv(int* /fd/*, const struct iovec **/vector/*, int* /count/*);* *int writev(int* /fd/*, const struct iovec **/vector/*, int* /count/*);*

*struct iovec {*

*__ptr_t* /iov_base/*;* /* Starting address */ *size_t* /iov_len/*;* /* Length in bytes */ *};*

DESCRIPTION *readv* reads data from file descriptor /fd/, and puts the result in the buffers described by /vector/. The number of buffers is specified by /count/. The buffers are filled in the order specified. Operates just like *read* except that data is put in /vector/ instead of a contiguous buffer.

*writev* writes data to file descriptor /fd/, and from the buffers described by /vector/. The number of buffers is specified by /count/. The buffers are used in the order specified. Operates just like *write* except that data is taken from /vector/ instead of a contiguous buffer.

RETURN VALUE On success *readv* returns the number of bytes read. On success *writev* returns the number of bytes written. On error, -1 is returned, and /errno/ is set appropriately.

    ERRORS *EINVAL*
    An invalid argument was given. For instance /count/ might be greater     than *MAX_IOVEC*, or zero. /fd/ could also be attached to an object     which is unsuitable for reading (for *readv*) or writing (for     *writev*).
*EFAULT*
    "Segmentation fault." Most likely /vector/ or some of the /iov_base/     pointers points to memory that is not properly allocated. *EBADF*
    The file descriptor /fd/ is not valid. *EINTR*
    The call was interrupted by a signal before any data was read/written. *EAGAIN*
    Non-blocking I/O has been selected using *O_NONBLOCK* and no data     was immediately available for reading. (Or the file descriptor /fd/     is for an object that is locked.)
*EISDIR*
    /fd/ refers to a directory.
*EOPNOTSUPP*
    /fd/ refers to a socket or device that does not support     reading/writing.
*ENOMEM*
    Insufficient kernel memory was available.

Other errors may occur, depending on the object connected to /fd/.

CONFORMING TO 4.4BSD (the *readv* and *writev* functions first appeared in BSD 4.2), Unix98. Linux libc5 uses *size_t* as the type of the /count/ parameter, which is logical but non-standard.

SEE ALSO

*read <http://www.die.net/doc/linux/man/man2/read.2.html>*(2), *write 
<http://www.die.net/doc/linux/man/man2/write.2.html>*(2), *fprintf 
<http://www.die.net/doc/linux/man/man3/fprintf.3.html>*(3), *fscanf 
<http://www.die.net/doc/linux/man/man3/fscanf.3.html>*(3)

REFERENCED BY

*recv <http://www.die.net/doc/linux/man/man2/recv.2.html>*(2), *recvfrom 
<http://www.die.net/doc/linux/man/man2/recvfrom.2.html>*(2), *recvmsg 
<http://www.die.net/doc/linux/man/man2/recvmsg.2.html>*(2), *socket 
<http://www.die.net/doc/linux/man/man7/socket.7.html>*(7), *syscalls 
<http://www.die.net/doc/linux/man/man2/syscalls.2.html>*(2)

As you can see, "readv" or "scattered read" is a single call to the OS and the wait is calculated as the time to completion. Full table scan will attempt to read as many as DB_FILE_MULTIBLOCK_READ_COUNT blocks from the file in various DB Block buffers in SGA (they are memory elements of the "vector"). Internally, OS might break this request into several I/O requests and complete everything as the controller scheduler (yes, disk controllers have I/O schedulers, usually using SPF type schedule ("Shortest Path First")). Your wait time is OS overhead+time to completion of EACH I/O request scheduled by OS to satisfy the call to readv. That may, on average, be significantly more then time necessary to satisfy a single block I/O, which is done by calling "read" or "write" (no trailing "v"). Your oracle waiting times will be reflecting this and blur any comparison. In short, you cannot compare. In the old tuning books, when DBA used to be much more conscientious of the underlying OS, as the best DBA staff was evolving from the primordial mud of system administration, there was an advice that size of DB_FILE_MULTIBLOCK READ_COUNT shouldn't ever exceed the maximum size of a single, atomic I/O that your OS
is capable of performing, but that piece of advice is lost in the newer books. It was a moot point anyway, because I/O could be broken because of the fragmentation as well.

-- 
Mladen Gogala
Oracle DBA
Ext. 121


--
http://www.freelists.org/webpage/oracle-l

Received on Thu Jun 16 2005 - 12:10:26 CDT