Re: Oracle Myths

From: Nuno Souto <nsouto_at_optushome.com.au.nospam>
Date: Sat, 8 Jun 2002 11:37:27 +1000
Message-ID: <3d016125$0$28006$afc38c87@news.optusnet.com.au>

In article <f369a0eb.0206070954.9b8c87c_at_posting.google.com>, you said (and I quote):
>
> That's an interesting one. I've always thought, when a disk system gets a
> request for data in a certain sector on a certain track, that it will take
> the shortest path to get to that location whether there is a LVM etc. sitting
> above. I sure like to see any literature which says otherwise.

Most file systems (ufs, NTFS, etc) will try to allocate space in what they think is an "optimal disk use" pattern. For example, in NTFS it's almost impossible to obtain a single allocation of disk space larger than the MFT size. NTFS will split it into two and place each half on each side of the MFT - which sits smack bang in the middle of the partition. Like it or not, you end up travelling over the MFT on a full table scan. Do two big chunk allocations in that partition and I can guarantee you even more fragmentation than just two halves around the MFT.

Traditionally, this was done because makers assumed their systems would be used in an "average" fashion, whatever that means. In NT's case, it was mostly as a Netware competitor as a file server. That was the "order of the day" when NTFS was thought out. It makes sense to do things this way in such an environment. You spread the load via the file system software itself, because there is no such thing as a database engine doing it for you.

Of course when systems are used as database servers, the demands on file space allocation and distribution are completely different. When a database asks for 20Gb of contiguous disk space, it MEANS it. Unfortunately, most file systems with default settings will still change it.

If you dig under the covers, there are usually options available at file system make time to modify this behaviour. For example, in Unix the minimum free space in a partition assigned to a file system is usually changed for database volumes. Otherwise you'll never be able to use all the space in the partition. The clustering factor (how many sectors are really allocated when you ask for any given space) is also usually changed. There is also a placement factor. Which tells the file system how far to place each file (in sectors) from the previous one. And so on.

There is also the issue of bitmap sector allocation - Yes, I'm afraid the LMT bitmap space control algorithm is nothing new... :-) - it's been used in Unix file systems since the year dot. About 30 years ago or so. That's what all that inode jazz is all about. This has to do with free space reuse after a few files have been created and removed. If the space is fragged to start with, there is no way subsequent accesses will make it disappear.

Space re-use. This is what happens for example when a file system has been in use with files being created and removed all over the partition and now someone asks for a big chunk. Odds on that the big chunk will be "created", but in bits and pieces all over the place on the physical disk partition. From now on, all accesses are virtually random, even though we may be full scanning a table. Hence the request in many books to only use freshly made file systems for database allocation.

There is also the problem of database partitions. If you have a revolving partition window (one that gets partitions added to the end often with older partitions removed from the start), then you got to be very careful how you map them to file systems. It is a bad idea for example to put more than one of these in each file system. For reasons to do with how file systems "optimize" frequent removal and creation of files.

Disk space allocation in Unix and NT file systems is a whole world. I've got some books on all this and will dig them up if you want. Come to think of it, this would be interesting to do a small paper for the next meet. I'll see what I can find.

People often ask me why I'm reluctant to use LMTs. Well, they are controlled by bitmaps. So are Unix file systems. I know the kind of problems Unix file systems can create by the very nature of this bitmap allocation process. Until I know more about EXACTLY how is Oracle handling those in LMTs, I'm reluctant to trust them implicitly.

>
> Five years ago 200GB was considered "LARGE". The trend is for databases to
> get even larger. Big corporations now add at least hundreds of megabytes
> to their data warehouse everyday. It's a different view from OLTP systems.
>

Narh, that's just warehouses. Most dbs out there are much smaller than that. Around 1Tb, some much less. Of course warehouses go up to tens of Tb and even more. But I've worked in some monster OLTP dbs. Largest one is just < 2 Tb and gets about 30Gb of new partitions added per day. Gets purged weekly and monthly into a warehouse, or it would simply not last long. That's when all hell can break lose: we had to make sure re-use of disk space in file system controlled partitions was still efficient.

-- 
Cheers
Nuno Souto
nsouto_at_optushome.com.au.nospam

Received on Fri Jun 07 2002 - 20:37:27 CDT