Re: Nested Sets vs. Nested Intervals

From: Vadim Tropashko <vadimtro_invalid_at_yahoo.com>
Date: 15 Nov 2005 19:02:22 -0800
Message-ID: <1132110142.536887.34410_at_g43g2000cwa.googlegroups.com>


asdf wrote:
> Since I like to use an array of folders, I would prefer to use the
> adjacency list. I just found out that getting an ID out of an array of
> folders in the URL is extremely efficient with the adjacency list, the
> same efficency as nested intervals. Also, adjacency list is extremely
> fast for moving categories. Adjacency list gets the same speed of
> nested intervals if I use an array of folders in the URL.

What is "array of folders"? Materialized path? What is adjacency list? Celko calls adjacency relation as adjacency list.

> If I use an ID in the URL instead of an array of folders, it would turn
> out very inefficient because it has to use multiple joins to find the
> path. However, when I use an array of folders in the URL, the path is
> already in the URL, so I don't have to find the path again. That method
> is very efficient.
>
> Counting all of the listings recursively in a subcat is anyway very
> inefficient

Counting the size of the subtree is efficient for Nested Sets only. It is (relatively) inefficient for all the other models.

> for all of the methods so I would indeed use an adjacency
> set.How would I count all of the listings recursively in a subcat
> efficently? Should I store the number of listings below every cat so I
> don't have to recursively use the count function?

You can indeed store the size of a subtree in each node. The update penalty is refreshing all the nodes on the path to the root (which is small compared to the size of a tree).

> When I was in the
> 'Arts' category on dmoz.org, and I refreshed that category a couple of
> times, on every page load, it showed the numbers of listings
> differently every time in some subcats in the 'Arts' category. Do I
> have to add a row that stores how many listings are in each category,
> or are there any other efficient ways to do this?

Why do you need a row? I thought you need a column.

> Because when every
> time I insert a new listing (web page listing, not a node in a tree) in
> a category, all of the ancestors listings_count row have to be updated.
> That would be inefficient because the higher the category, the more
> often listings_count row have to be updated, and it would slow down the
> directory sufficiently because the hard disk have to rotate to the
> category_count row frequently so it would slow down other stuff (like
> selecting categories) considerably. How does dmoz.org do it?

If you ever find out, please post it here:-)

> However, the nested interval encoding can fit into the adjacency-list
> model, so I could done that with nested intervals with the same
> efficency as adjacency-list. Are there any special advantages for the
> nested intervals model over the extremely efficent "adjacency-list
> model with the array of folders in the URL"? If there aren't any
> special advantages in the nested intervals model, then I would use the
> adjacency list.

Well, I'm confused with your terminology. Here is a brief summary of each model anyway:

  1. Adjacency relation (tree edges; standalone, or combined with the tree nodes): - Have to use proprietory SQL extensions for finding ancestors and descendants; although the queries are efficient - Finding descendants is relatively efficient (i.e. proportional to the size of the subtree) - Finding node's children is trivial - Finding node's parent is trivial - Aggregate queries are relatively slow - Tree reorg is very simple
  2. Nested Sets - Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) - Finding ancestors is easy but inefficient - Finding node's children as all the descendants restricted to the next level is inefficient (e.g. consider root node) - Finding node's parent as ancestor on the previous level is inefficient due to inefficiency of ancestors search - Aggregate queries are relatively slow (except counting, which is super fast)! - Tree reorg is hard
  3. Materialized Path - Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) - Finding ancestors is tricky but efficient - Finding node's children as descendants on next level is inefficient - Finding node's parent as ancestor on the previous level is efficient - All aggregate queries are relatively slow - Tree reorg is easy
  4. Nested Intervals - Same as MP: Finding descendants is easy and and relatively efficient (i.e. proportional to the size of the subtree) - Same as MP: Finding ancestors is tricky but efficient - Same as AR: Finding node's children is trivial - Same as AR: Finding node's parent is trivial - All aggregate queries are relatively slow - Tree reorg is easy (but not as simple as in AR)
Received on Wed Nov 16 2005 - 04:02:22 CET

Original text of this message