Re: Polygon data modelling question

From: Paul G. Brown <paul_geoffrey_brown_at_yahoo.com>
Date: 20 Jan 2002 12:03:05 -0800
Message-ID: <57da7b56.0201201203.779632ab_at_posting.google.com>


Andreas Bauer <baueran_at_in.tum.de> wrote in message news:<74w28.8060$OO5.1389707291_at_news.tli.de>...
> There are, of course, more than just two data structures
> out there to address spatial data. Recently I have stumbled across the
> UB-Trees which I think are part of Transbase, and then we also have Grid
> Files and GiST for example.

  1. GiST isn't strictly an access method either. It's an abstraction or interface to support a variety of tree structured access methods. You can build a B-Tree, or an R-Tree, and even a UB-Tree by implementing the necessary functions for the GiST framework to use.
     Open problems in GiST include concurrency and recovery. Although some
     work has been done on this problem -- Kornacker/Mohan paper -- I'm not
     sure that that work is as reasonable/simple as it needs to be from an
     engineering POV. For one thing, KM concurrency control depends on 
     predicate locking, which is a bit gnarly. 

  2. The UB-Trees look a lot like a variation on a space-filling curve with
     the added trick of navigating the index a little more intelligently. 
     Still, I don't see how I can use it to build indices to accelerate 
     "Overlap(Polygon, Polygon)", and the style of navigation will either
     diminish concurrency in the structure or won't do a good job on
     repeatable read and serializable transactions. (You either block inserts
     in bigger sub-sets of the index, or you allow them and say that you
     just can't do the higher isolation levels.)

     Another question in my mind is whether or not UB-Trees are any better 
     than the space-filling curve techniques (Hilbert, Z-Curve, HHCode) used
     by other companies. "Better" is not about single user-response time, but
     throughput. At some point it might all boil down to who can build their 
     index with minimal code pathlength, and we're down to <10% differences.

  3. Gridfiles can be very effective for points but have one dramatic draw-back.
     You need an accurate understanding of the size and spatial skew of
     the data set you're working with before you can build the index. That is,
     you need to know what max/min of each dimension and the max density of any
     area. Otherwise you degenerate to cases with lots of un-used space, and
     grid-files that are *very* deep in a few areas (whereas techniques like
     KDB-Tree/R-Tree/UB-Tree at least have the advantage of being height 
     balanced.)

     From a DBA point of view, imagine having to give a max/min value for
     each B-tree column, and then rebuilding the index whenever a data value
     turns up that violates your original estimate. Gridfiles work good for
     demos but . . .  

     Also, nobody has mentioned it, but the good old fashioned mechanism of
     conjunctive B-Trees works pretty good too. (i.e. scan more than one
     index & hash-join on the ROW_IDs before going back to the heap, much as
     star-joins work.)

  4. Folk who propose an access method that functions at very high D (D~>40)
     have a *very* big obstacle to overcome. It looks like this.

     As the dimensions of the object being indexed rises, so does the size of
     the object. If you allow the object's position along each axis to be
     encoded with some fixed length datum then the size of the object rises
     linearly with its axis count. 

     Now, in order to facilitate those pesky DBMS properties like recovery and
     concurrency control it is a *very* good idea to allocate data values to
     fixed size pages. (If you reject this lemma, then you have the additional
     responsibility to explain how to do recovery and concurrency over variable
     length objects - not impossible but not easy either.) The problem comes
     up when you realize that at a sufficiently high D you end up with 
     objects whose indexed values are *larger* than the page size. Even before
     that, at one object per page, you end up having to deal with a very large
     number of page reads (say, log(N) at base = 2 for N objects). Bigger 
     pages are a possibility (though they don't solve the underlying theoretical
     problem) but how much bigger? (GiST might help here.)

     The alternative is a table scan. Because a DBMS can derive such enormous
     advantages from read-ahead in scans, and can chunk multiple individual 
     page reads into a single disk operation, you end up finding that scanning
     the entire data set can result in better performance than an index scan. If
     you try to write a paper today that doesn't compare your access method
     with linear scan and shows it to be better, then your paper will have a
     very hard time being accepted. For very highly dimensional structures the
     Pyramid Tree manages this trick but the structure's design means that it
     has a number of degenerate "corner cases" where it blows badly.

     This lot are known as the "curse of dimensionality". It's a bit of 
     access method physics; just as you (famously) "can't move heat from a 
     colder to a hotter", you can't index a data set of N
     multi-dimensional objects where dimensions(N) > log(N). [I just made up
     that rule of thumb.]
 

> While I see the limitations of R-Trees clearly now (yes, I've read
> Guttman's paper as well), I'd be interested if anyone on the list has
> already made some experience with UB-Trees. When would you prefer them,
> compared to R-Trees for example?

#include <pontification.h>

    UB-Trees (and a range of other space-filling curve techniques) probably     outperform R-Trees indexing on box containment queries (show me all the     points within this box) where the performance objective is single user     response time over fixed data set. I don't know enough to be able to judge     among these techniques on these problems.

    My bet is on R-trees for throughput & response time on volatile data sets     of sized objects (box, circle, polygon, path, period, variant) with a     workload of spatial relationship predicates (overlap, within, contain,     equal) up to about 4D.

    Other problem domain strike me as either very hard (high D) or I just     don't have a clue (nearest neighbor).

     Hope this drives a few more comments/ideas/pontifications.

     KR

            Pb Received on Sun Jan 20 2002 - 21:03:05 CET

Original text of this message