Re: Creating a Hierarchical Structure

From: KWillets <kwillets_at_looksmart.net>
Date: 12 Sep 2001 16:32:58 -0700
Message-ID: <eef24980.0109121532.285cc61e_at_posting.google.com>


Bo Gundersen <bg_at_netnord.dk> wrote in message news:<3B963A1F.7080007_at_netnord.dk>...
> KWillets wrote:
> >>How would you, or what is the commonly accepted way, of creating a database
> >>which represents a hierarchical category structure ala Yahoo!?
> >
> > This works if every category has a single parent. However if a
> > category can have more than one parent (as in most web directories),
> > you need a more generalized representation. A typical implementation
> > would be as follows:
> >
> > create table node( node_id int primary key,...)
> > create table node_relationship( parent_node_id int references node,
> > child_node_id int references node,
> > primary key (parent_node_id, child_node_id))
> >
> > This structure can represent any directed graph structure, so some
> > problems can arise if someone creates a loop.
>
> How would you query this schema?
> I have no problem visualising a query of depth one (immideate
> childeren/parents), but I cannot see how/if it is possible to make
> recursive/hierachical queries (eg. find all children)?
>

That's correct, there is no way to query a set recursively in SQL-92. SQL3 (now called SQL 2000? SQL 1900?) allows recursive queries, but is only implemented to my knowledge by DB2, which has had it for 6-7 years.

The methods used for walking hierarchies in SQL-92 all depend on some kind of procedural code being run, either when the data is created or updated, or when the data is queried.

The code usually implements either depth-first search, breadth-first search, or transitive closure (constructing all possible ancestor/descendant pairs).

The depth-first methods are generally the slowest, due to the nature of database systems, the algorithm's single-threadedness, and the overhead of procedural code. Most implementations make a separate query for each edge traversed, yielding a glacial slowness. See Microsoft's "Expanding Hierarchies" chapter in their SQL server documentation for an example of how not to traverse a graph.

Breadth-first is somewhat better (say, 30 seconds instead of 30 hours). It's also less code. Here's a quickie example in SQL server:

create table #found( node_id int primary key)

insert into #found(node_id) values (_at_root_node_id) -- start with root

while( _at__at_rowcount > 0 )
 insert into #found(node_id) select distinct child_node_id  from #found p join node_relationship nr on p.node_id = nr.parent_node_id
 where child_node_id not in (select node_id from #found)

This code is fast, handles loops and multiple paths per node (the same node doesn't appear twice), and it's only three statements.

Another method is to create a table holding the transitive closure of the graph, in the form of (ancestor,descendant) pairs, where descendant is reachable from ancestor via one or arbitrarily many hops. Finding all the nodes under a given root is as simple as getting all descendants where ancestor id is the root node. It's also fairly easy to write the algorithm.

create table tc(anc_id int, desc_id int, primary key (anc_id, desc_id))

insert into tc select parent_node_id, child_node_id from node_relationship

while(_at__at_rowcount > 0 )
 insert into tc select distinct p.anc_id, c.desc_id  from tc p join tc c on p.desc_id = c.anc_id  where not exists (select * from tc t2 where t2.anc_id = p.anc_id

                    and t2.desc_id = c.desc_id)

This approach, of course, assumes that node_relationship won't change after tc is built. There is update logic published, but I'll leave that as an exercise for the reader.

One useful feature is that cycles in the graph appear as rows in the TC having anc_id = desc_id (since each node in a cycle has a path to itself). Received on Thu Sep 13 2001 - 01:32:58 CEST

Original text of this message