Re: Implementation of Grid File

From: Benjamin Johnston <superhero_at_benjaminjohnston.com.au>
Date: Sun, 5 Jan 2003 08:53:54 +1000
Message-ID: <OtJR9.16757$jM5.45907_at_newsfeeds.bigpond.com>


> Interesting concept. So what's the difference between what you have
> discussed and z-ordering b-trees? Excuse the ignorance but do b-trees not
> need to have atomic single dimensional data. So, if C&D was to be indexed
> then what values would be stored in the nodes?

A B-tree only needs totally ordered index values. It doesn't need to be atomic.

To index C&D without Z-ordering, you store ordered pairs: (C-value,D-value)
in the index.

Then if you have two values v1 and v2, where v1=(p,q) and v2=(m,n), then v1 > v2 iff (p > m) or (p = m and q > n) also v1 = v2 iff (p = m) and (q = n)

ie. you're basically indexing/ordering by C first... and then by D when the value for C is not unique.

If you have fixed length strings, or are using integers to some fixed precision (eg. 32 bit), then the index value is given by concatenating the values and just using the usual comparison operators. For example: 0000101 and 0000011 are indexed as the single value 00001010000011.

This is in contrast to Z-ordering which interleaves the bits (ie. the numbers above using an interleaved z-ordering might be: 00000000100111).

> I have been looking at hash table, however they are un usable due to there
> lack of performance when dealing large amount of data and single indexing
> properties.

Hash tables can be used to index multiple columns. Hash each column individually, and combine the two hash-values into a single hash value (in some way so that the distribution of the resultant hash is likely to be nearly random -- one possibility might be to multiply (allowing for overflow) each hash-value by two different large prime numbers and then add the results).

Where did you hear that hash tables give poor performance with a large amount of data? Sure, you'll get bad performance if the hash table is small and you've got lots of records... but you should be getting nearly just-one-disk-access-per-query performance if the hash table is the right size (usual advice is no more than 70% full). If you're referring to in-memory hash-tables, such as java.util.HashMap, then you're right... they'll give bad performance for large amounts of data (particularly when they resize themselves)... but hashes stored on disk (either linear hashes, or hashes that were initially created at the right size) should give you more than adequate performance.

I've never had problems with hash tables, even with millions of records. And if you search google, you'll find all sorts of situations involving huge amounts of data (eg. bioinformatics, large scale DBMSs) that are successfully indexed with hash tables (linear hashing). Though, it is important that you use a good hash function.

-Benjamin Johnston Received on Sat Jan 04 2003 - 23:53:54 CET

Original text of this message