Re: Raw partitions on HPUX 10 and Oracle 7.3.2

From: Joel Garry <joelga_at_rossinc.com>
Date: 1996/10/01
Message-ID: <1996Oct1.172757.3737_at_rossinc.com>


In article <325037AA.3610_at_pop.erols.com> "Michael J. Hillanbrand II" <mjhii_at_pop.erols.com> writes:
>Bolen Coogler wrote:
>>
>> Marvin Suther <marvin_suther_at_jhuapl.edu> wrote:
>>
>> >We are in the process of setting up Oracle 7.3.2 on an HP 9000 Model 821
>> >running HP-UX 10.2. We are trying to figure out any
>> >advantages/disadvantages as to the use of raw partitions. Are there any
>> >whitepapers that deal with this? Any experineces, either good or bad? You
>> >can email me at marvin_suther_at_jhuapl.edu.
 

>> >Thanks.

This is the summary of this long post:

} >(5) Anyone who does not have the time, expertise, or resources to
} >    perform the raw-vs-filesystem benchmark *should* *not* *consider*
} >    using raw devices.

The following collection of articles was the last go-around of this, what should be a faq item. Enjoy. (Apologies to those who have seen it again and again and again). In a strange order due to my newsreader settings.

Article 38721 of comp.databases.oracle:
Path: rossix!openlink.one-o.com!imci5!imci4!newsfeed.internetmci.com!howland.reston.ans.net!news-e2a.gnn.com!newstf01.news.aol.com!newsbf02.news.aol.com!not-for-mail From: markp28665_at_aol.com (MarkP28665) Newsgroups: comp.databases.oracle
Subject: Re: Raw Devices: Increased Performance? Date: 8 Jul 1996 22:40:14 -0400
Organization: America Online, Inc. (1-800-827-6364) Lines: 42
Sender: root_at_newsbf02.news.aol.com

Message-ID: <4rsgqe$4tn_at_newsbf02.news.aol.com>
References: <4rpsh3$s99_at_inet-nntp-gw-1.us.oracle.com>
Reply-To: markp28665_at_aol.com (MarkP28665)
NNTP-Posting-Host: newsbf02.mail.aol.com

My employer has had two Oracle dbms experts on site to look over our system. We paid a very high hourly rate (rumored to be 275.00/hour) for these people and we kept them each for a minimum of five days. Both recommended that we run on raw partitions and not UNIX file systems.

Raw partitions usually result in a 50% performance improvement for an individual physical I/O and this generally results in a 10% performance impovement for the database as a whole. Going to raw partitions with NOT help much if the problem is bad code. Most performance improvements will come from rewriting SQL and changing how applications work, not from changing the database.

Raw partitions are not any harder to manage than UNIX file system files if you plan your system out in advance and follow a few simple rules. You can move raw partitions around and redefine where they are located via UNIX without having to rename them via Oracle. Oracle should be stopped at the time, but we have done it several times to move files to new disks.  Switch from 'cpio' and 'tar' to 'dd' or the vendor provided fast character special data set copy utility for your backups and what real difficulity do raw partitions present.

In the old days raw partitions were superior than UNIX files because UNIX controls the buffering of the I/O and buffers recorded as written by Oracle could in fact not be written. With raw partitions the I/O is unbuffered by the OS. Most, but not all, modern UNIX system provide a write through the buffer function which tells UNIX not to buffer the I/O. Oracle development uses the unbuffered call when available or so one of the experts told me after talking to a private internal support resource. If your UNIX system does not support the write-thru the buffer method you may want to switch to protect your data.

Some Oracle options like parallel server (not query) require the use of raw partitions so you may not have an option. And as far as the Millsap paper goes I was advised by a friend with an inside contact in development to read it with a grain of salt because it was written to address the needs of Oracle support who mostly support small installations where the depth of knowledge is shallow, and because application developers always point to the database as the problem and not their code.

UNIX files systems work fine for most shops, but if you need every bit of performance that you can get then you should switch to raw partitions. This could be an endless discussion.

Article 39061 of comp.databases.oracle:

Xref: rossix comp.databases.oracle:39061 comp.sys.sequent:2074
Path: rossix!openlink.one-o.com!imci5!imci4!newsfeed.internetmci.com!inet-nntp-gw-1.us.oracle.com!pzola
From: pzola_at_us.oracle.com (Paul  Zola )
Newsgroups: comp.databases.oracle,comp.sys.sequent Subject: Re: Raw Devices: Increased Performance? Date: 15 Jul 1996 07:25:18 GMT
Organization: Oracle Corporation. Redwood Shores, CA Lines: 233
Message-ID: <4scrou$2ng_at_inet-nntp-gw-1.us.oracle.com> References: <31C9F801.33AB_at_tpg.tpg.oz.au> <1996Jun26.140851.7868_at_rossinc.com> <4quudo$eje_at_scel.sequent.com> <1996Jul2.001018.21899_at_rossinc.com> <4rpsh3$s99_at_inet-nntp-gw-1.us.oracle.com> <$x2BSJAQKW4xEwZw_at_smooth1.demon.co.uk> NNTP-Posting-Host: prodpyr1.us.oracle.com

In <$x2BSJAQKW4xEwZw_at_smooth1.demon.co.uk> David Williams <djw_at_smooth1.demon.co.uk> writes:

} In article <4rpsh3$s99_at_inet-nntp-gw-1.us.oracle.com>, Paul  Zola
} <pzola_at_us.oracle.com> writes
} >

 [snip]
} >(3) Under very common circumstances, going to raw devices can actually
} >    *decrease* database performance.
} 
}    ??? Explain - Not going with the UNIX buffer cache and also copying
}    (DMAing) directly into user space rather than via kernel space i.e.
}    one memory write rather than a memory write and a memory copy is
}    SLOWER??

Short answer: yep.

Flippant answer: Trying to reduce expected disk read times by

    reducing the number of memory copies is like trying to optimize a     program by moving code from the initialization modules into the     main loop.

Medium-length serious answer: the UNIX kernel is very good at

    optimizing file I/O. In particular, the kernel has the ability     to optimize I/O by scheduling read and write requests in a way     that minimizes disk seeks and rotational latency, which are the     big bottlenecks in any I/O system.

Before answering this in detail, I want to address another issue you raised, in part because my response to it will help explain how the buffer cache can win against raw devices.

You write:

}    Agreed - fragmentation does decrease performance dramtically but how
}    can filesystems be faster - loading inodes into memory and
}    following inodes means more seeking since inodes as not that close to
}    the data even with cylinder groups in use. You would expect at least 
}    one seek from inode table to the data which is not required when 
}    using raw devices.
 [snip]
}     How can it be faster???

I'm afraid that this passage betrays a certain lack of understanding of the UNIX kernel.

The inodes for open files are cached in memory, for as long as the file is open. Since the Oracle background processes open the database files when the database is mounted, and keep them open until the database is un-mounted, all kernel accesses to the inode are through the in-core copy (which, of course, requires no seek time).

A somewhat more relevant objection (which you did not raise) is that Oracle database files are big enough to require indirect blocks, and that accessing the data in the indirect blocks would require an extra disk access.

In practice, this extra disk access occurs very rarely. Since the indirect block is a filesystem block, the normal LRU caching algorithims apply to it as well. This means that the indirect blocks for any reasonably frequently-accessed database file will already be in the cache, and will not require an extra I/O. On a typical BSD filesystem, using 8k blocks & 4-byte disk addresses, any access to a disk within a 16-meg region will cause the indirect block to move to the head of the LRU queue.

(Why a 16 meg region? Here's the math: A single indirect block, 8k long, can contain 2048 4-byte disk addresses. Each one of these disk addresses points to an 8k disk block. This means that a single indirect block contains disk addresses for 8 * 1024 * 2048 bytes, or 16 meg.)

Finally, you seem to be under the impression that the UNIX kernel schedules disk I/O sequentially: first it reads the indirect block from disk, then it reads the pointed-to block from disk, and then it goes on to service another I/O request. This is not the case. Instead, the UNIX kernel is multi-threaded internally: disk I/O processing occurs asynchronously with respect to user processing, in a separate thread of control. (Note to any kernel hackers reading this: yes, I do know that the traditional UNIX implementation isn't really threaded; I do believe this to be the cleanest conceptual explanation.)

The disk I/O "thread" within UNIX schedules I/O using an "up/down elevator algorithm". The disk scheduler works something like this: when the kernel determines that a request has been made to read or write a disk block, it places a request for that I/O operation into a queue of requests for that particular disk device. Requests are kept in order, sorted by *disk block address*, and not by time of request. The disk driver then services the requests based on the location of the block on disk, and not on the order in which the requests were made.

The "up/down elevator algorithm" maintains a conceptual "hand", much like the hand on a wristwatch. This hand continually traverses the list of I/O requests: first moving from low disk addresses to high disk addresses, and then from high addresses down to low -- hence, "up/down elevator". As the "hand" moves up and down the list of requests, it services the requests in an order which minimizes disk seek times.

It's perhaps easier to conceptualize this based on a disk with a single platter: the "hand" moving up and down the list of disk block addresses exactly corresponds to the disk head seeking in and out on the drive. (On real-life multi-head drives, the sorting algorithm needs to be tuned so that all disk blocks on the same cylinder sort together. This, incidentally, is why mkfs and tunefs care about the number of sectors per track and the number of tracks per cylinder.)

Here's an example based on your (incorrect) example above. Say that we're opening a file for the first time. The kernel gets a request to read the inode into core, and the inode is at absolute disk block number 21345. This request goes into the request queue, and the requesting process is put to sleep. Some time later, the "hand" passes this disk block number, going from higher to lower disk addresses. The process that requested to read inode number 21345 is woken up and trys to read() the first block of that file. The kernel, acting on behalf of that process, determines that the first block of the file is located on disk block 21643, puts the request into the the request queue, and puts the requesting process to sleep again. Since the "hand" is moving towards lower addresses, the read request for block 21643 won't be serviced until the "hand" has serviced all the requests in the queue for blocks between 0 and 21643 twice -- once while traversing the request queue going down, and once while going up.

Why use this algorithm? It's possible to prove that in the presence of multiple processes making random I/O requests, some variant of the elevator algorithm will result in the shortest average seek time, and thus the greatest average throughput.

}
}     How can it be faster???
}

So after having gone through all that, here's the long answer to why using raw devices can be slower than the filesystem.

The important fact about disk optimization is this: disk seek times and rotational delays are measured in milliseconds, while RAM access times are measured in nanoseconds. This means that to maximize disk I/O the three most important optimizations are (in order):

    (1) Avoid I/O altogether (through caching or other mechanism).
    (2) Reduce seek times.
    (3) Reduce rotational delays.

Indeed, if we perform 500 memory-to-memory copies, and thereby avoid 1 seek, we're still ahead of the game.

Let's consider an Oracle database with the datafiles installed on raw devices. Let us say (to avoid the effects of datafile placement) that there's only one tablespace on this disk, and it's only used for indexes. Let us also say that this disk has 1024 cylinders, a maximum seek time of 100ms, and (just to give the raw device every advantage) that it has a track buffer so that there is no rotational latency delay. Finally, let's say that there are multiple users performing queries and updates at the same time, and these work out so there are 2 processes reading, and one process (DBWRITER) writing to this datafile at the same time.

Let us further consider the disk access pattern of these three processes. (In case it isn't clear -- this is Oracle having to read a disk block that isn't in the SGA in from the file system.) Let's say that over a time interval, they access disk blocks in the following pattern:

    read cylinder 1; read cylinder 1024 ; write cylinder 2; read     cylinder 1023; read cylinder 2; read cylinder 1022 ; write     cylinder 3; read cylinder 1021; write cylinder 4.

How much time did this take to complete? Well, since raw device accesses are scheduled in the order in which they occur, this disk had to seek almost from one end to the other 8 times, for a total of 800 ms.

Now, let's consider the same access pattern on a filesystem file, using the buffer cache. Assuming that these accesses came pretty quickly, they could all be handled in 3 scans of the request queue (2 going up, 2 going down) for a total of 300 ms. (Note that since UNIX has a write-back cache, the final write didn't necessarily force a write to disk: yet another optimization made possible by using filesystem files.)

Result? Filesytem wins by more than 100% over raw device.

How did this happen? It's a direct result of the access pattern of the database. The UNIX filesystem is optimized for random access, while raw devices work better for sequential access.

Note that if there wasn't a track buffer on the hard disk that we could have made the result even more lopsided for the filesystem, by having the access pattern force rotational delays when accessing the raw device.

This is why I say:

} >(4) Anyone contemplating going to raw devices should benchmark their
} >    application on both raw and filesystem devices to see if there is
} >    a significant performance increase in using raw devices.
}   
}    Not sure it's neccessary.

But we've just seen that it is. Let me emphasize: whether raw or filesystem devices will be faster depends DIRECTLY on the disk access patterns of *your* *specific* *application*. The exact same schema could be faster on one or the other, depending on what your users are doing with it.

This is not to say that using raw devices will always be slower. But they are far from the "magic bullet" that lots of DBAs seem to think they are.

And I'll say it again:

} >(5) Anyone who does not have the time, expertise, or resources to
} >    perform the raw-vs-filesystem benchmark *should* *not* *consider*
} >    using raw devices.


	-paul


References:

    `Operating Systems Design and Implementation'       Andrew S. Tannenbaum, Prentice-Hall, ISBN 0-13-637406-9     `The Design and Implementation of the 4.3BSD Unix Operating System',

      Samuel Leffler, Kirk McKusick, Michael Karels, John Quarterman,
      1989, Addison-Wesley, ISBN 0-201-06196-1
    `The Design of the Unix Operating System', Maurice Bach, 1986,
      Prentice Hall, ISBN 0-13-201757-1

==============================================================================
Paul Zola            Technical Specialist         World-Wide Technical Support
------------------------------------------------------------------------------
GCS H--- s:++ g++ au+ !a w+ v++ C+++ UAV++$ UUOC+++$ UHS++++$ P+>++ E-- N++ n+

    W+(++)$ M+ V- po- Y+ !5 !j R- G? !tv b++(+++) !D B-- e++ u** h f-->+ r*


Disclaimer: 	Opinions and statements are mine, and do not necessarily
		reflect the opinions of Oracle Corporation.


Article 39519 of comp.databases.oracle:

Xref: rossix comp.databases.oracle:39519 comp.sys.sequent:2082
Path: rossix!openlink.one-o.com!imci5!pull-feed.internetmci.com!news.internetMCI.com!newsfeed.internetmci.com!news.dacom.co.kr!arclight.uoregon.edu!dispatch.news.demon.net!demon!smooth1.demon.co.uk!djw
From: David Williams <djw_at_smooth1.demon.co.uk>
Newsgroups: comp.databases.oracle,comp.sys.sequent Subject: Re: Raw Devices: Increased Performance? Date: Mon, 22 Jul 1996 22:38:36 +0100
Organization: not applicable
Lines: 88
Distribution: world
Message-ID: <EiAXoAAcT$8xEwOu_at_smooth1.demon.co.uk> References: <31C9F801.33AB_at_tpg.tpg.oz.au> <1996Jun26.140851.7868_at_rossinc.com>
 <4quudo$eje_at_scel.sequent.com> <1996Jul2.001018.21899_at_rossinc.com>
 <4rpsh3$s99_at_inet-nntp-gw-1.us.oracle.com>
 <jkennedy-1907962210370001_at_tsde001-ppp5.us.oracle.com>
 <838041693.27945.0_at_gate.norwich-union.com>
NNTP-Posting-Host: smooth1.demon.co.uk
X-NNTP-Posting-Host: smooth1.demon.co.uk MIME-Version: 1.0
X-Newsreader: Turnpike Version 1.12 <9Hhi+s$5$1$z+7np$T9n6P9fJx>

In article <838041693.27945.0_at_gate.norwich-union.com>, M1069B00@?.? writes
>
>What about SVM???
>
>Having read this thread with interest (we use raw devices with ptx/SVM),
>the basic argument is that filesystems are better than raw devices if using
>OFA because :-
>
>* inodes are not a real overhead because they are in memory and commonly used
> indirect blocks are cached as filesystem blocks.
>

   Yes - sorry my brain rather overloaded with ideas against filesystems    on my last post and I got a bit carried away.

>* UNIX kernel uses elevator algorithm for data access on filesystems as opposed
> to the come first served algorithm inherent in raw device I/O.
>
 

   Generally disk drive controllers themselves which have the elevator    seek algorithm built-in to them, also the database engine has this    built-in as well. Why waste time doing it an extra time ?
>* In OFA you are spreading the datafiles across the disk bank and so thinning
> the load on any single controller (on average).
>

    Also fragmentation/ creation of tables within 'dbspaces' collections     of chunks of raw disk) allows spreading of databases/tables across     disks. This gives a clearer indication of disk layout to the     database engine (since raw device are continguous across the disk).

>This is comparing UNIX disk / filesystem algorithms versus raw devices as-is.
>(i.e. using basic character I/O control)
>But in the case of Sequent systems using ptx/SVM is it not a different issue?
>

    Running an Informix Online v7 database against a raw device provides     all of the benefits without a Device manager Layer.   

    Also OnLine has a better knowledge of where the table is stored when     doing read-ahead (if the table is not contiguous across the disk     then an operating system doing sequential read ahead may well read     areas of the disk which are then not used). Using a raw disk     overcomes this problem.

    How can the UNIX disk/filesystem algorithms be better when all they     see is a large file rather then the layout of tables/indices?

    Also the database engine can schedule high priority disk I/O e.g.     transaction log I/O ahead of say a search on a table.

>Raw device (/dev/rvol/,,,) read and writes go through the ptx/SVM layer
>which I thought introduced the same kinds of I/O optimizers as the filesystem
>(e.g. the elevator algorithm).
>It implements striping, and also balances the I/O load between mirrored copies
>(i.e. goes to the least busy side of the mirror for read requests).
>

   OnLIne also can implement striping (via table fragmentation) and    balances I/O load between mirrored copies using Informix mirroring.

>Is it not the case that raw volumes and raw volumes under SVM are very different
>in their performance capabilities?
>What about a new comparison based on the SVM enhanced raw volume?
>
>The post that spoke of a customer considering Oracle Parallel Server would
>clearly be referring to a system with SVM, but the arguments being made are
>not accounting for SVM, just dumb raw devices using basic stream I/O control.
>

  The database engine using raw devices should be faster then using   filesystems given that the filesystem code becomes just extra 'less   intelligent' layer sitting between the database engine and the disk   when using filesystems.

>Clarification from a guru requested please ...
>
>Regards, Steve Woolston. (woolsts_at_norwich_union.co.uk)

  Sorry it's been so long since my last post but I've had flu and have   just recovered. Also sorry if this sounds like an advert for Informix   but it's database engine I know the best. What do you mean Oracle   doesn't do all of the above? ;-> Answers on an e-mail please.

-- 
David Williams


Article 35267 of comp.databases.oracle:
Xref: rossix comp.databases.informix:17482 comp.databases.sybase:17397 comp.databases.oracle:35267 comp.databases.ms-sqlserver:472
Path: rossix!openlink.one-o.com!imci5!imci4!newsfeed.internetmci.com!inet-nntp-gw-1.us.oracle.com!news
From: tkyte_at_us.oracle.com (Thomas J Kyte)
Newsgroups: comp.databases.informix,comp.databases.sybase,comp.databases.oracle,comp.databases.ms-sqlserver
Subject: Re: Informix, Sybase, Oracle or MS SQL server
Date: Mon, 29 Apr 1996 13:51:51 GMT
Organization: Oracle Corporation. Redwood Shores, CA
Lines: 126
Message-ID: <4m2hrb$5cp_at_inet-nntp-gw-1.us.oracle.com>
References: <3177C4FA.671_at_tc.net> <4llrps$lhd_at_inet-nntp-gw-1.us.oracle.com> <DqJ278.1J5_at_mv.mv.com>
NNTP-Posting-Host: tkyte-lap.us.oracle.com
X-Newsreader: Forte Free Agent 1.0.82

Paul Chen <pchen_at_cougar.mv.com> wrote:


>tkyte_at_us.oracle.com (Thomas J Kyte) wrote:
>>
>>- Database does not need to be taken offline,
>>hot (online) backups have been part of Oracle since version 6

>But Tom, can we use hot backup on UNIX with Oracle database on raw
>devices?
Yes, absolutely.
>The impression I got from Michael R. Ault's book "Oracle 7.0
>Administration & Management" seems to imply the negative answer
>to the question above. This is what he said in his book, on
>page 287.

> "The hot backup differs from the cold backup in that only
> sections of the database are backed up at one time. Under
> UNIX this will require the use of normal mounted file systems
> and not RAW devices."
There are some cases where Oracle only supports RAW Partitions, the Oracle parallel server for loosely coupled systems for example. Hot backup is fully supported on these platforms (and all others). I think the wording choosen by Mike in his book is poor on this page. He says for example: <quote> 8.1.3 UNIX Backups The type of backup you perform in UNIX is dependent on whether you use RAW devices or not. RAW devices will require a backup of the entire device, while the use of the mounted file systems will allow partial backups. <!-- My addition "OF DEVICES"> .... </quote> If you have a raw device, you have to back up the entire device. If you are using a cooked file system and have lots of little datafiles on them, you back up little datafiles one at a time, you don't backup the entire filesystem (device). In short, you either backup an entire raw disk partition or not. You copy the entire partition, not just the part of the partition containing data (so if you have a 2 gig raw partition and only have a 1 gig datafile on it, you will dd the entire partition, not just the 1 gig part of it you are using. This is OK, since you would never put a 1 gig datafile on a 2 gig raw partition since the other 1 gig would always be wasted and couldn't be used by anyone else. You would never have this case). He is quite wrong in stating > "The hot backup differs from the cold backup in that only > sections of the database are backed up at one time. Under > UNIX this will require the use of normal mounted file systems > and not RAW devices." I have done many a hot backup on 100+ gig databases consisting entirely of raw partitions. This is absolutely wrong. On the same page he goes on to say: <quote> A hot backup, or one taken while the database is active, can only give a read-consistent copy, but it doesn't handle active transactions. </quote> Is wrong as well. A hot backup actually takes Inconsistent copies (the datafiles are not consistent with respect to a point in time). The datafiles continue to be written to during a hot backup. A hot backup requires both the datafile(s) and archived redo log files to recover with. The restore process has you putting back the inconsistent datafile and using the redo log files to roll forward/back transactions to recover the datafile to a point in time and make it consistent again.
>So, if he is correct, are we required to use filesystems to build
>Oracle 7.0 database on UNIX if we want hot backup? If so, will this
>causedata integrity problem? Again, from the same book, on page 18,
>Michael R. Ault said the following.

> "If you use normal UNIX file devices, there is no guarantee that
> all updates will be written to disk in a timely manner. This is
> due to the UNIX file buffers. If a system crash should occur
> between the time data leaves the Oracle buffers and transits
> through the UNIX buffers, data could be lost."
He again is incorrect here. There has been an ongoing thread in these newsgroups (comp.databases.sybase and comp.databases.oracle "Database Writing Architectures" is the subject of the thread). As pointed out by many in that thread, UNIX allows for either the O_SYNC options when opening a file for update or the unix system call fsync() can be used on a file to sync it up (flush the buffers). As was also pointed out in that thread, this can cause severe performance implications in a Single Process, Multi-Threaded database. Since Oracle is not a single process database, we do not suffer from the performance impact. In fact, since the DBWR process is the only process writing to the datafiles and we can have many DBWR processes going to simulate ASYNC IO on cooked partitions, his statement of "... performance for disk IO by over 50 percent ..." is way overstated. In most implementations I have seen, a modest 10-15% can be gained in certain heavily IO bound applications by switching to raw partitions, 50% is way overstated.
>Both hot backup and data integrity are essential to many 24x7 shops.
>But Michael R. Ault's book seems to imply that hot backup and
>raw devices are mutually exclusive for Oracle 7.0 on UNIX. Is he
>right or wrong? Tom, can you clarify these issues? Thanks.
he is wrong (mostly due to poor wording in the above cases, the crux of what he writes is for the most part correct).
>Paul Chen, Ph.D.
>Database Consultant
>Disclaimer: My opinions only!
Thomas Kyte tkyte_at_us.oracle.com Oracle Government -------------------------------------------------------- opinions and statements are mine and do not necessarily reflect the opinions of Oracle Corporation. -- Joel Garry joelga_at_rossinc.com Compuserve 70661,1534 These are my opinions, not necessarily those of Ross Systems, Inc. <> <> %DCL-W-SOFTONEDGEDONTPUSH, Software On Edge - Don't Push. \ V / panic: ifree: freeing free inodes... O
Received on Tue Oct 01 1996 - 00:00:00 CEST

Original text of this message