Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: "O/S block size" for Windows NT

Re: "O/S block size" for Windows NT

From: Nuno Souto <nsouto_at_nsw.bigpond.net.au.nospam>
Date: 2000/05/03
Message-ID: <39100edb.5278837@news-server>

On Thu, 27 Apr 2000 21:50:32 +0800, Connor McDonald <connor_mcdonald_at_yahoo.com> wrote:

PMFJI, but this keeps cropping up again and again. Let's see if this will make it sufficiently clear.

>> 1) some absolute value -- usually 2048 bytes

Incorrect.

>> 2) whatever the cluster size is -- which can be 512, 1024, 2048, etc.,
>> up to 64k, and is set (manually or by default) when you format the
>> volume

That's the unit of I/O request, NOT the "block size". The block size can be ANYTHING that an application requests. The NT low-level I/O routines do as many sector-sized I/Os as needed to satisfy the request.

Of course, if the application (whatever it may be) requests EXACT multiples of the cluster size, there is an advantage in efficiency of I/O. And even better if the I/O request by the application matches EXACTLY the cluster size of the partition.

This all comes from the confusion in terminology generated by MS and ORACLE. Let's be clear here:

1- "Blocks" do not physically exist anywhere. An application may do "block-oriented" I/O or NOT entirely up to its design. The "blocking" and "unblocking" happen INTERNALLY to the application.

What an application does is request an I/O of SOME given size, period.

Be that 1 byte long, 123456789 bytes long or ANYTHING integer in between.

If that size matches what the application coder decided to call a block (or a page), then so be it. But it doesn't stop being a request for a GIVEN size whatever that may be, as far as NT is concerned.

2- NT receives requests for I/O and acts upon them. In this "acting", it takes into consideration how the disks are partitioned, the MFT, the cluster size, the file segments, the nature of the request (I or O), the flags associated with the request, etc.

If the flags say unbuffered, NT ignores the file cache. Otherwise it uses it to check if the requested data is there. Simple.

As for the Write-through flag, that takes action in writes and ensures that the file system cache is not involved with its potential "write-behind" access strategy. Ie, the write goes to the disk and returns before the application can proceed. To do with recovery, for obvious reasons.

But all that is flavour. Now we deal with the **real** I/O that NT has to do.

Regardless of what was requested, NT cannot do physical I/O in units of size smaller than disk sector size. The hardware won't let it happen. Period.

Disk sector sizes in modern disks (SCSI and IDE) are ALWAYS 512 bytes. Period. Even with LBA and things like that, the ACTUAL physical disk sector size is ALWAYS 512 bytes. Period again!

3- For efficiency reasons, NT tries to do I/O in units of size corresponding to the cluster size of the disk partition, which are ALWAYS multiples (from 1* to n*) of the disk sector size. This because in NTFS, space allocation units for files are always in cluster "chunks". You will also find that in NTFS, clusters are ALWAYS a multiple of 512 bytes. Do you see a pattern emerging here?

4- Clusters act as a unit of allocation of space for each partition. Nothing more, nothing less. They are always a multiple of the disk sector size because that optimizes the disk space usage, otherwise you'd have unreachable/unused areas in the disk.

5- As an optimization, NT tries to do I/O in multiples of cluster size. This is because the MFT is organized into cluster addressing (to KNOW what logical address space of a file maps to in disk sectors and clusters).

Derived rules:

1- For efficiency of I/O, applications that BYPASS the file system cache should ALWAYS request I/O in sizes MULTIPLE of disk sector size, ie, multiples of 512.

2- For efficiency of disk head actuator mechanism, applications that BYPASS the file system cache should not request I/O in sizes greater than cluster size (a cluster is guaranteed to be an adjacent set of disk sectors, but logically consecutive clusters in a file MAY NOT be physically adjacent).

3- Ergo, database applications that use unbuffered and write through techniques should use a unit of I/O size that is a multiple of 512 bytes AND is LESS or EQUAL to the cluster size of the target partition. Personally, I prefer EQUAL. For reasons that will become clear.

>> scoured) that NT does I/O in, say, 2048-byte units. Nor is there any
>> reference that a cluster is anything more than a unit of disk space
>> allocation, as opposed to a unit of I/O.

In WinNT magazine, a few years ago Mark Russinovitch dealt with this. At the time he showed that the unit of physical I/O request for a partition (the famous "OS block size") is always the size of the disk sector size, with NT favouring I/O in sizes of cluster of the partition. And he proved it using some MS doco. Can't remember which one. Might be worth an investigation through their archives?

>> In addition, there is documentation that Oracle (and SQL Server for that
>> matter) actually does datafile I/O on the sector level:

Don't forget: application I/O requests are NOT necessarily the same size as the actual I/O request as issued by the low-level I/O routines in NT.

The unit of disk adddressing in NT is the sector and the unit of space allocation in a partition is the cluster (always a multiple of sector size). This may or may not match the database "block size" or whatever. Low level I/O routines in NT can only read/write ENTIRE disk sectors, not portions thereof.

>> manage their own cache buffers. Unbuffered I/O requests must be issued
>> in multiples of the disk sector size"[source: NT Workstation 4.0
>> Resource Kit, "Chapter 15 - Detecting Cache Bottlenecks"]

Exactly. You'll also find that clusters are ALWAYS multiples of 512. Guess why? :-)

>> "Oracle uses non-buffered reads and writes on NT. We therefore bypass
>> the NT file system cache and are thus only concerned with I/O in
>> multiples of the O/S blocksize (512 bytes)...Cluster sizes come into
>> play simply as an allocation unit...As far as Oracle is concerned (and
>> NT for that matter), its datafiles are already created...So if we want
>> to read a 2k Oracle block we pass this to the O/S who simply does a 2k
>> read (4 sectors) from the existing file and passes the results back to
>> Oracle and our own buffer cache. The allocation unit (cluster) has no
>> affect on this."

Correct to a certain extent. Of course clustering may have an "affect".

If the application I/O request is for 1024 bytes (two sectors) and the cluster size is 512 (1sector), then two distinct physical I/O operations MAY be required (although app issued one I/O request), IF the two clusters aren't adjacent. This is highly dependent on the "intelligence" built into the disk controller and the device driver AND the degree of disorganization of the disk space.

Consecutive clusters making up the address space of a file are not necessarily physically adjacent. So clustering MAY have a clear effect.

The reason most "tests" don't show any difference is that they are made starting from freshly created empty partitions. In that case, consecutive clusters in a database file will in all likelyhood be physically adjacent. Try them in an old, used partition and watch the fireworks...

Now you know why I favour application I/O to match in size the cluster size when using unbuffered I/O. There is NO way I'll get more than I asked for in that case.

In general, the database application should make requests for I/O in sizes that equal the partition's cluster size. This will ensure that one application I/O translates to a single physical I/O in all cases.

And of course, the low level I/O (transparent to the application) will be sized in number of disk sectors ("OS block size") that match the size of the cluster!

>> So, here's what I'm asking: can any of you out there, who feel that the
>> NTFS cluster size, or something else, is the low-level unit of
>> datafile I/O for Oracle on NT, come up with any chapter-and-verse
>> references which substantiate that assertion? I would love to see them
>> for myself, as, I'm sure, would a lot of other people.

Check out the Winnt mag web site archives, it's there. Also, read the above. The information provided by ORACLE and MS is correct. But it's missing a unifying explanation, which I hope was provided here.

>
>I think Oracle uses direct I/O on NT, which should not be affected by
>the NT block sizes.
>

ALL direct I/O in NT is affected by disk sector sizes, not block sizes. "Block" sizes can be any size, even down to 1 byte in the case of buffered I/O! What happens at the low-level coal front is transparent to the application. And to most monitoring tools, including the NT ones.

Cheers
Nuno Souto
nsouto_at_nsw.bigpond.net.au.nospam
http://www.users.bigpond.net.au/the_Den/index.html Received on Wed May 03 2000 - 00:00:00 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US