Re: Polygon data modelling question
Date: 20 Jan 2002 12:03:05 -0800
Message-ID: <57da7b56.0201201203.779632ab_at_posting.google.com>
Andreas Bauer <baueran_at_in.tum.de> wrote in message news:<74w28.8060$OO5.1389707291_at_news.tli.de>...
> There are, of course, more than just two data structures
> out there to address spatial data. Recently I have stumbled across the
> UB-Trees which I think are part of Transbase, and then we also have Grid
> Files and GiST for example.
- GiST isn't strictly an access method either. It's an abstraction or interface to support a variety of tree structured access methods. You can build a B-Tree, or an R-Tree, and even a UB-Tree by implementing the necessary functions for the GiST framework to use.
Open problems in GiST include concurrency and recovery. Although some work has been done on this problem -- Kornacker/Mohan paper -- I'm not sure that that work is as reasonable/simple as it needs to be from an engineering POV. For one thing, KM concurrency control depends on predicate locking, which is a bit gnarly. 2. The UB-Trees look a lot like a variation on a space-filling curve with the added trick of navigating the index a little more intelligently. Still, I don't see how I can use it to build indices to accelerate "Overlap(Polygon, Polygon)", and the style of navigation will either diminish concurrency in the structure or won't do a good job on repeatable read and serializable transactions. (You either block inserts in bigger sub-sets of the index, or you allow them and say that you just can't do the higher isolation levels.) Another question in my mind is whether or not UB-Trees are any better than the space-filling curve techniques (Hilbert, Z-Curve, HHCode) used by other companies. "Better" is not about single user-response time, but throughput. At some point it might all boil down to who can build their index with minimal code pathlength, and we're down to <10% differences. 3. Gridfiles can be very effective for points but have one dramatic draw-back. You need an accurate understanding of the size and spatial skew of the data set you're working with before you can build the index. That is, you need to know what max/min of each dimension and the max density of any area. Otherwise you degenerate to cases with lots of un-used space, and grid-files that are *very* deep in a few areas (whereas techniques like KDB-Tree/R-Tree/UB-Tree at least have the advantage of being height balanced.) From a DBA point of view, imagine having to give a max/min value for each B-tree column, and then rebuilding the index whenever a data value turns up that violates your original estimate. Gridfiles work good for demos but . . . Also, nobody has mentioned it, but the good old fashioned mechanism of conjunctive B-Trees works pretty good too. (i.e. scan more than one index & hash-join on the ROW_IDs before going back to the heap, much as star-joins work.) 4. Folk who propose an access method that functions at very high D (D~>40) have a *very* big obstacle to overcome. It looks like this. As the dimensions of the object being indexed rises, so does the size of the object. If you allow the object's position along each axis to be encoded with some fixed length datum then the size of the object rises linearly with its axis count. Now, in order to facilitate those pesky DBMS properties like recovery and concurrency control it is a *very* good idea to allocate data values to fixed size pages. (If you reject this lemma, then you have the additional responsibility to explain how to do recovery and concurrency over variable length objects - not impossible but not easy either.) The problem comes up when you realize that at a sufficiently high D you end up with objects whose indexed values are *larger* than the page size. Even before that, at one object per page, you end up having to deal with a very large number of page reads (say, log(N) at base = 2 for N objects). Bigger pages are a possibility (though they don't solve the underlying theoretical problem) but how much bigger? (GiST might help here.) The alternative is a table scan. Because a DBMS can derive such enormous advantages from read-ahead in scans, and can chunk multiple individual page reads into a single disk operation, you end up finding that scanning the entire data set can result in better performance than an index scan. If you try to write a paper today that doesn't compare your access method with linear scan and shows it to be better, then your paper will have a very hard time being accepted. For very highly dimensional structures the Pyramid Tree manages this trick but the structure's design means that it has a number of degenerate "corner cases" where it blows badly. This lot are known as the "curse of dimensionality". It's a bit of access method physics; just as you (famously) "can't move heat from a colder to a hotter", you can't index a data set of N multi-dimensional objects where dimensions(N) > log(N). [I just made up that rule of thumb.]
> While I see the limitations of R-Trees clearly now (yes, I've read
> Guttman's paper as well), I'd be interested if anyone on the list has
> already made some experience with UB-Trees. When would you prefer them,
> compared to R-Trees for example?
#include <pontification.h>
UB-Trees (and a range of other space-filling curve techniques) probably outperform R-Trees indexing on box containment queries (show me all the points within this box) where the performance objective is single user response time over fixed data set. I don't know enough to be able to judge among these techniques on these problems.
My bet is on R-trees for throughput & response time on volatile data sets of sized objects (box, circle, polygon, path, period, variant) with a workload of spatial relationship predicates (overlap, within, contain, equal) up to about 4D.
Other problem domain strike me as either very hard (high D) or I just don't have a clue (nearest neighbor).
Hope this drives a few more comments/ideas/pontifications.
KR
Pb Received on Sun Jan 20 2002 - 21:03:05 CET