Extendible Hashing question on Splitting to Handle Overflow

From: patrick <tendim_at_vex.net>
Date: Tue, 18 Dec 2001 02:05:23 +0000 (UTC)
Message-ID: <9vm893$2j1r$1_at_news.tht.net>



(The following is for a disk-based extendible hash-table).

I've got a question on extendible hashing, and what exactly occurs when buckets are split to handle overflow.

Say my trie looks like this:

				.
			       / \
                              0   1
                             /     \
                            A      /\
                                  0  1
                                 /    \
                                B      C

Which results in the following directory:

	Hash-Code-Lower-Bits	Bucket
	00			A
	01			A
	10			B
	11			C

Now, according to my trie, any hash-code that ends with a zero goes to bucket A, and any hash-code that ends with 10 to B and 11 to C. My main question is, say I generate some hash-codes for some entires, and I get the following:

	Entry	HashCode	Bucket
	K1	00		A
	K2	11		C
	K3 	01		A
	K4	01		A
	K5	10		B
	K6	00		A
	K7	10		B

Further, say that each bucket can only hold four entries. Given this, K1, K3, K4 and K6 have all mapped to bucket A. Now say I add another entry, K8, and it also maps to A.

        K8 00 A

This means that I have a split. I can use the low-order bit of the 0 branch from the root of my tree to make up another bucket, and effectively split A up:

				 /\
				/  \
			       /    \
                              0      1
                             /        \
                            /\        /\
                           0  1      0  1
                          /    \    /    \
                         A      D  B      C

From the original directory, K3 and K4 were both mapped to A. However, with this new trie they should both map to D. Does this mean that I have to read the data back from Bucket A on disk, and then rewrite it to Bucket D? I.e., go thorugh all of the entries for Bucket A on disk
(K1, K3, K4 and K6), regenerate their HashCodes (simple enough), and
then find out where they belong after the split (A or D)? If so, isn't this a severe performance penalty? If it is possible to hold an entire bucket in memory, this means at least three seeks: one to read in the bucket that has been split, one to write out the entries that go to the old bucket (in this case A) and one to write out the entires that go to the new bucket (in this case D). And if we can *not* hold an entire bucket in memory, this means that there are even more seeks.

Or am I looking at the implementation all wrong? I can understand the principles behind the splitting, but I am confused as to how we guarantee an entry goes to the correct bucket after we start splitting things up.

Many thanks,
Patrick

-- 
patrick_at_tendim.cjb.net
http://www.tendim.cjb.net/
Like anime?   Got hotline?
hotline://twoaday.cjb.net/
Received on Tue Dec 18 2001 - 03:05:23 CET

Original text of this message