Re: Tree structure in Relational DB design

From: --CELKO-- <71062.1056_at_compuserve.com>
Date: 10 Jun 2002 13:26:46 -0700
Message-ID: <c0d87ec0.0206101226.5311c69e_at_posting.google.com>


I am not sure if the first attmept at a reply posted or not, but I really got into this one and did a re-write.

--CELKO--


>> First of all, Mr. Celko conveniently fails to mention that his
Nested Sets model fails miserably if the hierarchy it needs to model is subject to a lot of update activity (inserting new nodes, deleting, moving branches, etc. ), because then it needs to "renumber" a lots of nodes. <<

No, this is not actually a problem in practice.

The number of rows per datapage is quite large because each row is (node_key, lft, rgt), or two integers and whatever the key datatype is. The renumbering tends to be done in main storage. If your SQL uses indexes (not all products do; hashing, bit vectors, etc. are other options) and you have a covering index, it is quite fast.

Say, a dozen bytes (three 32 bit integers) per row, 16 Kbytes per datapage, allow a little overhead and you can still get more than 1300 nodes into one datapage. Since the renumbering proceeds left to right, it can be in a partial table scan, starting at the first gap. Because nodes are added on the right of the parent, the expected worst case is less than half the rows will need to be renumbered on an insertion.

>> An elementary improvement might be brought to Mr. Celko's model by
declaring the rgt and lft column as double precision floats instead of integers. Then inserts under a certain node and other tree modifications can be done a lot more conveniently, without affecting the whole tree structure. <<

Nope; try that and it was like skydiving with your lucky anvil. Things go fine for awhile, then you hit some problems <g>.

Aside from the size of large floats (64 bits or bigger), most commercial databases do not run on hardware with floating point co-processors. Many commercial databases do not even have a floating point datatype -- Oracle, for example, is one such popular SQL product that you might see in use <g>.

But assuming that you do have hardware or software floating point support, the lack of precision catches up with you. Two different floats test equal if they are within an epsilon of each other. Finding the epsilon is a little tricky in the IEEE Floating Point Standards and it is hell to write it as a CHECK () constraint yourself.

What you can do when you have a lot of insertions is leave gaps in the (lft, rgt) numbering. A 32 bit integer with a range of over 8 billion unique values (we can use negative numbers too) is usually big enough.  If that is not big enough, three are some other tricks for breaking a tree into a forest, but I digress.

I have a newsgroup thread where a guy was building a message board and needed to keep the threads in a tree. He started with an adjacency list model and found the cursors to traverse and display the trees (message threads) to be too slow for his application. He claimed about a 30% improvement from switching to the Nested Sets model with large gaps.

>> Second, Mr. Celko falsely claims that the adjacency list model is
denormalized "in several ways". That's plain non-sense. It would only be true, if we could drop all the definitions of normal forms that are well established in database theory books, and adhere to his fuzzy ideas of what constitutes "normalized" . <<

Let's go back to Dr. Codd and the Basics and prove it. A table is denormalized when it has redundancies and anomalies.

Non-redundant = one fact, one place, one time

Anomaly = creation of false information, deletion of a fact, or the inability to record a fact in the database.

Now let's run a test:

CREATE TABLE OrgChart
 (emp CHAR(10) NOT NULL PRIMARY KEY,
  boss CHAR(10)

       REFERENCES OrgChart(emp));

 OrgChart
 emp boss


 'Albert'  'NULL'
 'Bert'    'Albert'
 'Chuck'   'Albert'
 'Donna'   'Chuck'
 'Eddie'   'Chuck'
 'Fred'    'Chuck'

  1. 'Chuck' wants to change his name to 'Charles', but he appears in four places. If I update 'Chuck' only in the employee column then 'Donna', 'Eddie' and 'Fred' are reporting to a false boss value -- anomaly!
  2. If I update 'Chuck' only in the boss column then 'Donna', 'Eddie' and 'Fred' are still reporting to a false boss value -- anomaly!
  3. So I have to update 'Chuck' in both employee and boss column -- a redundancy which leads to an anomaly!

>> I'd be very curious to see if Mr. Celko can demonstrate, if he will
or if he can at all, what normal form, as defined in standard database theory books <<

Since you wanted a reference, here is the first one I found grabbing a book off the shelf by my computer.

DEFINITION 5.5.1 Update Anomaly

A table T is subject to an update anomaly when changing a single attribute value for an entity instance or relationship instance represented in the table may require that several rows of T be updated.

(DATABASE DESIGN by Patrick O'Neil, page 328, Morgan-Kaufmann Publishers)

But why stop here? Let's fire 'Chuck' and see what happens.

  1. If I delete 'Chuck' only in the employee column then 'Donna', 'Eddie' and 'Fred' are reporting to a false boss value -- anomaly!
  2. If I delete 'Chuck' only in the boss column then 'Donna', 'Eddie' and 'Fred' are also removed from the Organization -- The Egyptian Pharaoh has his slaves buried with him, I guess.

If they had subordinates themselves, my organizational chart would become a forest of four separate trees, not a single organization. That is clearly destruction of information. And here is your reference again:

DEFINITION 5.5.2 Delete Anomaly, Insert Anomaly

A table T is subject to an delete anomaly when deleting some row of the table to reflect the disappearance of some instance of an entity or relationship can cause us to lose information about some instance of a different entity or relationship that we do not wish to forget.

The insert anomaly is the other face of this problem for inserts, where we cannot represent information about some entity or instance without including information about some other instance of an entity or relationship that does not exist.

(DATABASE DESIGN by Patrick O'Neil, page 328, Morgan-Kaufmann Publishers)

>> As commonly accepted a table is normalized if it is in the 5NF. The
highest normalization possible is Domain Key Normal Form, but since generally one cannot normalize a schema up to DKNF, the 5NF is considered enough to say about a schema that is normalized. <<

The truth is that we are lucky to get 3NF as "commonly accepted practice" and we get 3NF only if someone was using an ER diagram. But I digress ...

>> ... Although he constructs a straw man here, because the common
practice is not to put the hierarchy information in the same table (let's say with Employee information in this case), but in a separate table, .. <<

Again, the common practice is to cram both node and structure information into one table. It is wrong, but it is common practice. Look at the Oracle's Scott/Tiger database; look at Codd' first response to the objections that the Relational Model could not handle data as well as IMS; look at the posting on any database newsgroup. The adjacency list is the most obvious implementation for a procedural programmer. It is linked lists all over again.

What I find now when I teach a class is that XML and HTML programmers have no trouble with the Nested Sets. They see tags instead of linked lists.

>> ... and more so because usually a hierarchy has an existence that
is separate from the objects that occupy the nodes of the hierarchy. In this example the hierarchy of positions in the enterprise is separate from the persons that occupy positions at any given time. <<

Agreed -- a table is a set of entities or relationship, but not both. But you are ignoring the other problem; the adjacency list model is a tree, not a hierarchy. Subordination is not inherited in the adjacency list model. Yet inheritance is the defining characteristic of a hierarchy.

>> In reality, when one models a hierarchy one has to choose a
representation that makes a compromise between the performance of updates to the tree structure and the performance of queries against the tree structure. <<

Not quite right. You can have four situations:

  1. Infrequent structure changes with frequent node changes:

A company organizational chart changes less often than the personnel. Maybe not at Enron, but this is pretty much true everywhere.

2) Frequent structure changes with infrequent node changes:

Newsgroup message threads; multiple users are inserting all the time (threads), but the nodes (messages) are never changed once they get there.

3) Infrequent structure changes with infrequent node changes:

Classification systems. The Dewey decimal Library Classification scheme has been pretty stable at the higher levels for over 100 years.

4) Frequent structure changes with frequent node changes:

Help! I cannot think of a good example for this one.

But I think I would be safe in saying that most applications do MANY more queries than insert/update/delete operations, so tuning for queries is more important.

>> The adjacency list model has the best performance for updates but
might pose some inconveniences in expressing recursive queries (like "get all employees that are subordinated directly or indirectly to employee X"), <<

I will grant the faster updates, but I do not see it as an advantage because of what you lose in queries. The "inconveniences" are orders of magnitude, not a few micro-seconds, when the hierarchy is large.

>> However, in newer databases, this isn't a problem (Oracle has had a
CONNECT BY clause for hierarchies for quite some time, and in Oracle 9i support has been improved, and IBM DB2 v7.x supports recursive queries ). <<

Most applications with a hierarchy in their nature need queries that aggregate up that hierarchy. This is where cursors, whether explicit or hidden like Oracle's proprietary, non-relational CONNECT BY syntax slow down too much. That extension is limited to one table and the order of the output carries information about the levels (you mention Dr. Codd's information principle later; here is a violation).

The WITH operator in DB2 is a hidden loop that iterates over the table, building a UNION ALL result set. You can imagine the speed compared to one SELECT in the Nested Sets model.

And all these extensions are both proprietary and will not port.

>> ... If you have a largely static hierarchy structure, you might use
your Nested Sets model very well, although in this case there is another structure that performs better. You just create a string artificial primary key where the key of the parent is the prefix for all the children. <<

This is Itzak Ben-Gan's Path Enumeration model. It is pretty fast for smaller static hierarchy structures. The trade-offs are in the direction of the search; go from the root to the leaf nodes is fast (x LIKE '1.2.%'), but going from the leaf to the root is slow (x LIKE '%.3.4.5').

The string handling needed for searching and altering the tree structure can get a little complex if the artificial key is poorly designed. Using your numeric example, you need to clean up the data so that '.01.' and '.1.' do not both appear and mean the same thing. Likewise, if the same substring appears in an artificial key more than once, you have to be able to tell what each occurrence means and not replace or delete the wrong one(s). Parts explosions have this problem since the same subassembly or part might be used in several assemblies.

>> The solution with establishing the hierarchy by the property of
string keys of being prefixes of all the children key, is far from perfect if the hierarchy is again subject to heavy updates, but in case of largely stable hierarchies it performs quite well (I have reasons to suspect it performs a lot better than Mr. Celko's Nested Sets model). <<

We did some measurements on this in one newsgroup thread. The results were mixed. In 25 words or less, the Nested Sets did better as the Path Enumeration strings got longer. No great surprise.

Another weakness is that structure is hidden in path enumerations. (I hope this ASCII graphic survives).

                  Robert
                   /   \
                  /     \
                 /       \
                /         \
           Albert        Allen
          /     \         /   \
         /       \       /     \
       Bert     Chuck   Brad    Chico
               /  | \            /| \
              /   |  \          / |  \
             /    |   \        /  |   \
            /     |    \      /   |    \
          Donna Eddie Fred David Ethen Fannie


Albert and Allen supervise subtrees with the identical STRUCTURE, but not the same nodes. I can find this structural similarity easily in the nested set model by using a little algebra on the (lft, rgt) pairs of each subtree to set them to ((lft - MIN(lft), rgt- (MIN(lft)). Union the canonical forms and see if the table is made of pairs of identical rows.

Now, try to write such a query in either the Path Enumeration or Adjacency List model without using procedural code. Even with procedurtal code and recursion this is a great exercise in converting LISP to SQL.

The problem is that the Path Enumeration and Adjacency List models do not represent all of the structure -- each row in them is a part of a fact, but not the whole fact.

>> And further more it kind of breaks Dr. Codd's information principle
that every information should be represented explicitly as a value in a particular column of a particular row. To account for this principle the hierarchical relation would then have to be represented explicitly only in the adjacency list model (the Nested Sets model fails as well), but in practice since we're not dealing with relational databases anyway, we have to adjust to what we have at hand. <<

No, the use of (lft, rgt) pairs for a scalar value is fine in the relational model; SQL has to use two columns right now instead of allowing you to create a datatype or domain for this.

Remember that "atomic" (i.e. scalar) comes from the Greek word for "without parts" -- if you split it up, the value is destroyed. More common examples are durations of time (start_time, ending_time) and co-ordinates (Cartesian or Polar).

>> The theoretically "right" solution is to use the adjacency list
mode, and hope that RDBMSs will eventually come up with the right solution not only to express a recursive query (which can be done currently), but also to index and optimize for speedy retrieval of such queries. <<

The adjacency list model is based on graph theory and the relational model is based on sets. Graphs are not sets. There are set-oriented tree models other than the Nested Sets model. So, I do not understand why you want to break the relational model when there is no need to do so. Received on Mon Jun 10 2002 - 22:26:46 CEST

Original text of this message