Re: Raw Devices: Increased Performance?

From: Paul Zola <pzola_at_us.oracle.com>
Date: 1996/07/15
Message-ID: <4scrou$2ng_at_inet-nntp-gw-1.us.oracle.com>


In <$x2BSJAQKW4xEwZw_at_smooth1.demon.co.uk> David Williams <djw_at_smooth1.demon.co.uk> writes:

} In article <4rpsh3$s99_at_inet-nntp-gw-1.us.oracle.com>, Paul Zola
} <pzola_at_us.oracle.com> writes
} >

 [snip]
} >(3) Under very common circumstances, going to raw devices can actually
} > *decrease* database performance.
}
} ??? Explain - Not going with the UNIX buffer cache and also copying
} (DMAing) directly into user space rather than via kernel space i.e.
} one memory write rather than a memory write and a memory copy is
} SLOWER??
Short answer: yep.

Flippant answer: Trying to reduce expected disk read times by

    reducing the number of memory copies is like trying to optimize a     program by moving code from the initialization modules into the     main loop.

Medium-length serious answer: the UNIX kernel is very good at

    optimizing file I/O. In particular, the kernel has the ability     to optimize I/O by scheduling read and write requests in a way     that minimizes disk seeks and rotational latency, which are the     big bottlenecks in any I/O system.

Before answering this in detail, I want to address another issue you raised, in part because my response to it will help explain how the buffer cache can win against raw devices.

You write:

} Agreed - fragmentation does decrease performance dramtically but how
} can filesystems be faster - loading inodes into memory and
} following inodes means more seeking since inodes as not that close to
} the data even with cylinder groups in use. You would expect at least
} one seek from inode table to the data which is not required when
} using raw devices.

 [snip]
} How can it be faster???

I'm afraid that this passage betrays a certain lack of understanding of the UNIX kernel.

The inodes for open files are cached in memory, for as long as the file is open. Since the Oracle background processes open the database files when the database is mounted, and keep them open until the database is un-mounted, all kernel accesses to the inode are through the in-core copy (which, of course, requires no seek time).

A somewhat more relevant objection (which you did not raise) is that Oracle database files are big enough to require indirect blocks, and that accessing the data in the indirect blocks would require an extra disk access.

In practice, this extra disk access occurs very rarely. Since the indirect block is a filesystem block, the normal LRU caching algorithims apply to it as well. This means that the indirect blocks for any reasonably frequently-accessed database file will already be in the cache, and will not require an extra I/O. On a typical BSD filesystem, using 8k blocks & 4-byte disk addresses, any access to a disk within a 16-meg region will cause the indirect block to move to the head of the LRU queue.

(Why a 16 meg region? Here's the math: A single indirect block, 8k long, can contain 2048 4-byte disk addresses. Each one of these disk addresses points to an 8k disk block. This means that a single indirect block contains disk addresses for 8 * 1024 * 2048 bytes, or 16 meg.)

Finally, you seem to be under the impression that the UNIX kernel schedules disk I/O sequentially: first it reads the indirect block from disk, then it reads the pointed-to block from disk, and then it goes on to service another I/O request. This is not the case. Instead, the UNIX kernel is multi-threaded internally: disk I/O processing occurs asynchronously with respect to user processing, in a separate thread of control. (Note to any kernel hackers reading this: yes, I do know that the traditional UNIX implementation isn't really threaded; I do believe this to be the cleanest conceptual explanation.)

The disk I/O "thread" within UNIX schedules I/O using an "up/down elevator algorithm". The disk scheduler works something like this: when the kernel determines that a request has been made to read or write a disk block, it places a request for that I/O operation into a queue of requests for that particular disk device. Requests are kept in order, sorted by *disk block address*, and not by time of request. The disk driver then services the requests based on the location of the block on disk, and not on the order in which the requests were made.

The "up/down elevator algorithm" maintains a conceptual "hand", much like the hand on a wristwatch. This hand continually traverses the list of I/O requests: first moving from low disk addresses to high disk addresses, and then from high addresses down to low -- hence, "up/down elevator". As the "hand" moves up and down the list of requests, it services the requests in an order which minimizes disk seek times.

It's perhaps easier to conceptualize this based on a disk with a single platter: the "hand" moving up and down the list of disk block addresses exactly corresponds to the disk head seeking in and out on the drive. (On real-life multi-head drives, the sorting algorithm needs to be tuned so that all disk blocks on the same cylinder sort together. This, incidentally, is why mkfs and tunefs care about the number of sectors per track and the number of tracks per cylinder.)

Here's an example based on your (incorrect) example above. Say that we're opening a file for the first time. The kernel gets a request to read the inode into core, and the inode is at absolute disk block number 21345. This request goes into the request queue, and the requesting process is put to sleep. Some time later, the "hand" passes this disk block number, going from higher to lower disk addresses. The process that requested to read inode number 21345 is woken up and trys to read() the first block of that file. The kernel, acting on behalf of that process, determines that the first block of the file is located on disk block 21643, puts the request into the the request queue, and puts the requesting process to sleep again. Since the "hand" is moving towards lower addresses, the read request for block 21643 won't be serviced until the "hand" has serviced all the requests in the queue for blocks between 0 and 21643 twice -- once while traversing the request queue going down, and once while going up.

Why use this algorithm? It's possible to prove that in the presence of multiple processes making random I/O requests, some variant of the elevator algorithm will result in the shortest average seek time, and thus the greatest average throughput.

}

} How can it be faster???
}

So after having gone through all that, here's the long answer to why using raw devices can be slower than the filesystem.

The important fact about disk optimization is this: disk seek times and rotational delays are measured in milliseconds, while RAM access times are measured in nanoseconds. This means that to maximize disk I/O the three most important optimizations are (in order):

    (1) Avoid I/O altogether (through caching or other mechanism).
    (2) Reduce seek times.
    (3) Reduce rotational delays.

Indeed, if we perform 500 memory-to-memory copies, and thereby avoid 1 seek, we're still ahead of the game.

Let's consider an Oracle database with the datafiles installed on raw devices. Let us say (to avoid the effects of datafile placement) that there's only one tablespace on this disk, and it's only used for indexes. Let us also say that this disk has 1024 cylinders, a maximum seek time of 100ms, and (just to give the raw device every advantage) that it has a track buffer so that there is no rotational latency delay. Finally, let's say that there are multiple users performing queries and updates at the same time, and these work out so there are 2 processes reading, and one process (DBWRITER) writing to this datafile at the same time.

Let us further consider the disk access pattern of these three processes. (In case it isn't clear -- this is Oracle having to read a disk block that isn't in the SGA in from the file system.) Let's say that over a time interval, they access disk blocks in the following pattern:

    read cylinder 1; read cylinder 1024 ; write cylinder 2; read     cylinder 1023; read cylinder 2; read cylinder 1022 ; write     cylinder 3; read cylinder 1021; write cylinder 4.

How much time did this take to complete? Well, since raw device accesses are scheduled in the order in which they occur, this disk had to seek almost from one end to the other 8 times, for a total of 800 ms.

Now, let's consider the same access pattern on a filesystem file, using the buffer cache. Assuming that these accesses came pretty quickly, they could all be handled in 3 scans of the request queue (2 going up, 2 going down) for a total of 300 ms. (Note that since UNIX has a write-back cache, the final write didn't necessarily force a write to disk: yet another optimization made possible by using filesystem files.)

Result? Filesytem wins by more than 100% over raw device.

How did this happen? It's a direct result of the access pattern of the database. The UNIX filesystem is optimized for random access, while raw devices work better for sequential access.

Note that if there wasn't a track buffer on the hard disk that we could have made the result even more lopsided for the filesystem, by having the access pattern force rotational delays when accessing the raw device.

This is why I say:

} >(4) Anyone contemplating going to raw devices should benchmark their
} > application on both raw and filesystem devices to see if there is
} > a significant performance increase in using raw devices.
}
} Not sure it's neccessary.

But we've just seen that it is. Let me emphasize: whether raw or filesystem devices will be faster depends DIRECTLY on the disk access patterns of *your* *specific* *application*. The exact same schema could be faster on one or the other, depending on what your users are doing with it.

This is not to say that using raw devices will always be slower. But they are far from the "magic bullet" that lots of DBAs seem to think they are.

And I'll say it again:

} >(5) Anyone who does not have the time, expertise, or resources to
} > perform the raw-vs-filesystem benchmark *should* *not* *consider*
} > using raw devices.

        -paul

References:

    `Operating Systems Design and Implementation'       Andrew S. Tannenbaum, Prentice-Hall, ISBN 0-13-637406-9     `The Design and Implementation of the 4.3BSD Unix Operating System',

      Samuel Leffler, Kirk McKusick, Michael Karels, John Quarterman,
      1989, Addison-Wesley, ISBN 0-201-06196-1
    `The Design of the Unix Operating System', Maurice Bach, 1986,
      Prentice Hall, ISBN 0-13-201757-1

==============================================================================
Paul Zola            Technical Specialist         World-Wide Technical Support
------------------------------------------------------------------------------
GCS H--- s:++ g++ au+ !a w+ v++ C+++ UAV++$ UUOC+++$ UHS++++$ P+>++ E-- N++ n+

    W+(++)$ M+ V- po- Y+ !5 !j R- G? !tv b++(+++) !D B-- e++ u** h f-->+ r*


Disclaimer: 	Opinions and statements are mine, and do not necessarily
		reflect the opinions of Oracle Corporation.
Received on Mon Jul 15 1996 - 00:00:00 CEST

Original text of this message