Re: Tree structure in Relational DB design

From: Costin Cozianu <c_cozianu_at_hotmail.com>
Date: 10 Jun 2002 22:24:40 -0700
Message-ID: <2cf20de8.0206102124.36579f60_at_posting.google.com>


Mr. Celko,

On the second consideration I must admit I have rushed with the apreciation of the performance of the nested set model. It is likely that renumbering might not be a significant issue in the average practical case.

However, it is extremely dubious what happens in conditions of concurrency if several distinct connections attempt an operation that trigger the renumbering. Simply the fact that a simple update or insert would result into a cascaded update, is not very encouraging.

Again, based on particular practical conditions one might find this perfectly acceptable, I wasn't advocating to drop the nested set model altogether.

However, I must take issue with your strange idea of what "normalized" means. By your account, any schema whatsoever that has a foreign key is denormalized, because if one updates the primary key or deletes the record altogether, the effects would have to be propagated in all the places.

In fact, this is not the case, or we'd need to give up the very idea of normalization altogether. The normalization as is commonly accepted refers to elimination of functional and join dependencies.

There may be other dependencies that are not removed by the normalization process, and there are some dependencies for which the current theoretical model cannot offer the necessary support to avoid them. Some dependencies are unavoidable, because that's how they are in the domain being modeled. You are the son of your father, so part of any of your records will depend upon the identifying information of your father, no matter how clever a schema you want to devise.

I'll adress youtr specific definitions below.

>
> >> An elementary improvement might be brought to Mr. Celko's model by
> declaring the rgt and lft column as double precision floats instead of
> integers. Then inserts under a certain node and other tree
> modifications can be done a lot more conveniently, without affecting
> the whole tree structure. <<
>
> Nope; try that and it was like skydiving with your lucky anvil.
> Things go fine for awhile, then you hit some problems <g>.
>
> Aside from the size of large floats (64 bits or bigger), most
> commercial databases do not run on hardware with floating point
> co-processors. Many commercial databases do not even have a floating
> point datatype -- Oracle, for example, is one such popular SQL product
> that you might see in use <g>.

Yep, only 38 digits precision. Should be enoiugh though for many trees :)

> But assuming that you do have hardware or software floating point
> support, the lack of precision catches up with you. Two different
> floats test equal if they are within an epsilon of each other.
> Finding the epsilon is a little tricky in the IEEE Floating Point
> Standards and it is hell to write it as a CHECK () constraint
> yourself.
 

> I have a newsgroup thread where a guy was building a message board and
> needed to keep the threads in a tree. He started with an adjacency
> list model and found the cursors to traverse and display the trees
> (message threads) to be too slow for his application. He claimed
> about a 30% improvement from switching to the Nested Sets model with
> large gaps.

Of course cursors over adjacency lists are slow, unless the DBMS engine is able to offer a declarative way to physically optimize table layout for such queries.  

> >> Second, Mr. Celko falsely claims that the adjacency list model is
> denormalized "in several ways". That's plain non-sense. It would only
> be true, if we could drop all the definitions of normal forms that are
> well established in database theory books, and adhere to his fuzzy
> ideas of what constitutes "normalized" . <<
>
> Let's go back to Dr. Codd and the Basics and prove it. A table is
> denormalized when it has redundancies and anomalies.

Well, that's by your definition. In fact, as I mentioned before any table related to another by means of a foreign key has "redundant" values. This is not a criteria by which a table is said to be denormalized.  

> Non-redundant = one fact, one place, one time

That's exactly the case. When I have say INSERT INTO OrgChart VALUES ('Costin', 'Celko') I declare the proposition that there is a employee with the name 'Costin' (which is also the identifying info) and he is subordinated to THE employee named 'Celko'. I'm not reasserting anything declared in Celko's row.

But some facts are related to others by necessity, I have to refer to a related entity in some way, and that way is through the values of some candfidate key. This is not denormalization by any account.

> Anomaly = creation of false information, deletion of a fact, or the
> inability to record a fact in the database.

Yes, with the mention that tables in the 5NF are possibly subject to some subtle update anomalies that functional and join dependencies don't account for.

However they are said to be normalized . Could we agree at least that a table in 5NF is said to be normalized ?  

> Now let's run a test:
>
> CREATE TABLE OrgChart
> (emp CHAR(10) NOT NULL PRIMARY KEY,
> boss CHAR(10)
> REFERENCES OrgChart(emp));
>
> OrgChart
> emp boss
> ===================
> 'Albert' 'NULL'
> 'Bert' 'Albert'
> 'Chuck' 'Albert'
> 'Donna' 'Chuck'
> 'Eddie' 'Chuck'
> 'Fred' 'Chuck'
>
> 1) 'Chuck' wants to change his name to 'Charles', but he appears in
> four places. If I update 'Chuck' only in the employee column then
> 'Donna', 'Eddie' and 'Fred' are reporting to a false boss value --
> anomaly!
 

> 2) If I update 'Chuck' only in the boss column then 'Donna', 'Eddie'
> and 'Fred' are still reporting to a false boss value -- anomaly!
>
> 3) So I have to update 'Chuck' in both employee and boss column -- a
> redundancy which leads to an anomaly!

Following your logic no schema ever designed by Celko was normalized because I'm sure it must have contained at least one foreign key. And if we try to update/delete the master record we run into trouble.  

> >> I'd be very curious to see if Mr. Celko can demonstrate, if he will
> or if he can at all, what normal form, as defined in standard database
> theory books <<
>
> Since you wanted a reference, here is the first one I found grabbing a
> book off the shelf by my computer.
>
> DEFINITION 5.5.1 Update Anomaly
>
> A table T is subject to an update anomaly when changing a single
> attribute value for an entity instance or relationship instance
> represented in the table may require that several rows of T be
> updated.

>

> (DATABASE DESIGN by Patrick O'Neil, page 328, Morgan-Kaufmann
> Publishers)

Well, I was asking precisely for the definiton of a normal form (1,2,3,4,5 you choose) and how OrgChart breaks it, this you fail to produce.

Update anomalies might eb a consequence of denormalization but they are not generally a criteria for denormalization. Some of them may be unavoidable. They belong to the very nature of modeling. Some might be resolved with declarations such ON UPDATE CASCADE, some need to be accounted by other means (at the application level or through database stored procedures/triggers).

For example, how would you model in a relational database the grid of an electronic circuit whith the constraint that the grid must be at any given time a planar graph ?

>
> But why stop here? Let's fire 'Chuck' and see what happens.
>
> 1) If I delete 'Chuck' only in the employee column then 'Donna',
> 'Eddie' and 'Fred' are reporting to a false boss value -- anomaly!
>
> 2) If I delete 'Chuck' only in the boss column then 'Donna', 'Eddie'
> and 'Fred' are also removed from the Organization -- The Egyptian
> Pharaoh has his slaves buried with him, I guess.
>
> If they had subordinates themselves, my organizational chart would
> become a forest of four separate trees, not a single organization.
> That is clearly destruction of information. And here is your
> reference again:
>
> DEFINITION 5.5.2 Delete Anomaly, Insert Anomaly
>
> A table T is subject to an delete anomaly when deleting some row of
> the table to reflect the disappearance of some instance of an entity
> or relationship can cause us to lose information about some instance
> of a different entity or relationship that we do not wish to forget.
>
> The insert anomaly is the other face of this problem for inserts,
> where we cannot represent information about some entity or instance
> without including information about some other instance of an entity
> or relationship that does not exist.

>

> (DATABASE DESIGN by Patrick O'Neil, page 328, Morgan-Kaufmann
> Publishers)

Exactly by this definition, if I insert a node under a "leftish" node in your nested set model, I'd generate lots of invalid facts (a big part of the tree would need to be renumbered).

> >> As commonly accepted a table is normalized if it is in the 5NF. The
> highest normalization possible is Domain Key Normal Form, but since
> generally one cannot normalize a schema up to DKNF, the 5NF is
> considered enough to say about a schema that is normalized. <<
>
> The truth is that we are lucky to get 3NF as "commonly accepted
> practice" and we get 3NF only if someone was using an ER diagram. But
> I digress ...

The sad truth is that practitioners are unaware of what 5NF is, and the "apparent" difficulty of the 5NF is use a scarecrow. The reality is that once one reaches the BCNF (most would rweach BCNF instictively because schemas that violate BCNF, usually are easy to tell just by common sense) there is absolutely no reason whatsoever not to have it in 5NF.

Many of the practical schemas are in 5NF ANYWAYS because they don't deal with ternary or higher order relationships (the only case in which one could break 5NF or 4NF), like for example the very common Microsoft example with ndNorthwind. It is amusing though how practitioners are scared by 5NF as if it would be something out of ordinary that would be required of them.  

> >> ... Although he constructs a straw man here, because the common
> practice is not to put the hierarchy information in the same table
> (let's say with Employee information in this case), but in a separate
> table, .. <<
>
> Again, the common practice is to cram both node and structure
> information into one table. It is wrong, but it is common practice.
> Look at the Oracle's Scott/Tiger database; look at Codd' first
> response to the objections that the Relational Model could not handle
> data as well as IMS; look at the posting on any database newsgroup.
> The adjacency list is the most obvious implementation for a procedural
> programmer. It is linked lists all over again.

Well, I was refering then to common practice "pour les connaiseurs". Really,
it is quite natural to model the hierarchy separately from its occupants, and from my very limited point of view it is (or should be common).

> >> ... and more so because usually a hierarchy has an existence that
> is separate from the objects that occupy the nodes of the hierarchy.
> In this example the hierarchy of positions in the enterprise is
> separate from the persons that occupy positions at any given time. <<
>
> Agreed -- a table is a set of entities or relationship, but not both.
> But you are ignoring the other problem; the adjacency list model is a
> tree, not a hierarchy. Subordination is not inherited in the
> adjacency list model. Yet inheritance is the defining characteristic
> of a hierarchy.

A describing attribute of an entity may be its relation with another entity.
For example in a Employey table you might have a DeptNo attribute without any need whatsoever to maintain this relationship in a separate table.  

> >> In reality, when one models a hierarchy one has to choose a
> representation that makes a compromise between the performance of
> updates to the tree structure and the performance of queries against
> the tree structure. <<
>
> Not quite right. You can have four situations:
>
> 1) Infrequent structure changes with frequent node changes:
>
> A company organizational chart changes less often than the personnel.
> Maybe not at Enron, but this is pretty much true everywhere.
>
> 2) Frequent structure changes with infrequent node changes:
>
> Newsgroup message threads; multiple users are inserting all the time
> (threads), but the nodes (messages) are never changed once they get
> there.

>

> 3) Infrequent structure changes with infrequent node changes:
>
> Classification systems. The Dewey decimal Library Classification
> scheme has been pretty stable at the higher levels for over 100 years.
>
> 4) Frequent structure changes with frequent node changes:
>
> Help! I cannot think of a good example for this one.
>
> But I think I would be safe in saying that most applications do MANY
> more queries than insert/update/delete operations, so tuning for
> queries is more important.
>
> >> The adjacency list model has the best performance for updates but
> might pose some inconveniences in expressing recursive queries (like
> "get all employees that are subordinated directly or indirectly to
> employee X"), <<
>
> I will grant the faster updates, but I do not see it as an advantage
> because of what you lose in queries. The "inconveniences" are orders
> of magnitude, not a few micro-seconds, when the hierarchy is large.
>
> And all these extensions are both proprietary and will not port.

Well, all the databases are proprietary to a non-trivial degree anyways. Considering how many of the queries issued in an application system are recursive queries one might find proprietary extension acceptable.   

> The adjacency list model is based on graph theory and the relational
> model is based on sets. Graphs are not sets. There are set-oriented
> tree models other than the Nested Sets model. So, I do not understand
> why you want to break the relational model when there is no need to do
> so.

Sure enough graphs are not sets. Trees are not sets and directed acyclic graphs are not sets, and forests are not sets and so on, so forth. But in essence graphs trees, forests can easily be modeled as binary relations over the set of nodes. That's what the adjacency list is doing.

The relatively difficult part is to provide a general theory of operators over such structures that can fit in a relational database engine. Encoding trees as nested set is a clever mathematical artifice, I grant you that, but is not the essence of the facts that we need to model. It is merely an encoding, and the update anomalies are a reflection of that.

A binary relations (as is the parent-child) in the case of nodes is essentially what is all about and binary relations are modeled by a "relationship" table. The queries that we need to answer with "all the children of a given node" is in essence a recursive query, and to these end an ideal DBMS engine might provide a special physical layout (maybe a specialized index), and an optimizer hook, that will take that into account.

The theory for providing such features is already there, if I'm not mistaken, but DBMS vendors are busy implementing XMLs and other UFOs. Received on Tue Jun 11 2002 - 07:24:40 CEST

Original text of this message