High-Performance Tree structures
Date: 15 Aug 2001 15:43:45 -0700
Message-ID: <44d7582c.0108151443.1403f995_at_posting.google.com>
I would like some comments on a tree structure I have, and suggestions as to where I might send it for publication, along with good benchmarks in the field to test it against. I've tested ANSI C versions on LINUX, HP-UX, and Solaris. On LINUX I used the gcc compiler with no optimizations, and with the other OSs I used the native compiler with no optimizations.
I've designed, written, and tested an ordered tree structure that has a constant average amortized access time for additions, removals, and look-ups. Before everyone jumps on my case about Knuth's 1973 proof about comparison based sorting, I will mention that its sorting mechanism is not purely comparison based, but it will operate on any set of keys that can be put in an ascending or descending order.
This tree structure uses a theta(n) average case hierarchical sorting algorithm that has some similarity to the divide and conquer technique, but its tree is not very high, and stays at a rather constant height. The algorithm is O(nlog(n)), along with having better worst-case performance (<O(nlog(n))) for everything but a particular rare distribution. It is demonstrably theta(n) for random, gaussian, multimodal gaussian, and English text string distributions. This is about three times as fast as quicksort (for 100kB-1GB files), including the time this algorithm spends building the tree structure. Its relative worst-case performance is 30% slower than Quicksort in boolean distributions where Quicksort is O(n). Both the Quicksort library call and this sorting mechanism take in an identical and very fast comparison function as an argument, though this mechanism also uses distributional techniques.
This tree structure also uses a searching technique that is compatible with the hierarchy of the sorting technique mentioned above, and has a constant average look-up time. It is O(log(n)), but only performs this badly with a particular rare type of distribution. This look-up technique is about four times as fast as binary search, with no file type showing worse performance (the minimum improvement was 50%). Of course, in an ordered tree structure the search is somewhat different than binary search in a sorted array, and is probably faster. I'm not sure of the other methods currently in usage today, and may be testing against weaker mechanisms than the best.
I've combined these two techniques through a hierarchical tree structure, and think it would be useful for normal to high cardinality databases. I should note that this algorithm ends up using about 8 bytes more per datum than absolutely necessary for storage. The rare distribution which makes this technique take its full O(log(n)) worst-case amortized time is a wide, sparse branching structure that follows a very precise and unusual branch distance. I've never seen a real distribution that was similar to this at all. I don't expect to see it, but for accuracy it should be mentioned.
Does anyone have suggestions as to where I should look for places to publish and/or test/demonstrate this technique? Received on Thu Aug 16 2001 - 00:43:45 CEST