Re: Bill of materials / Groups in Groups

From: Harry Chomsky <harryc_at_chomsky.net>
Date: 2000/01/13
Message-ID: <EWqf4.23$Sf.2906_at_nnrp3-w.snfc21.pbi.net>


joe_celko_at_my-deja.com wrote in message <85l2o8$8ri$1_at_nnrp1.deja.com>...
>
>>> Now I'm very confused. What definition of normalization applies
>here? Are we talking about 3NF or BCNF or some other mathematical
>definition? Or is it just the intuitive idea that "each fact should be
>stored in only one place"? <<
>
>Actually that is a cute phrase to define normalization -- the more
>formal version is that inserts, updates and deletes do not cause
>anomalys in the database. That is, the database is always in a valid
>and internally consistent state at the end of a transaction. 1NF to
>5NF, BCNF, DKNF, et al are _kinds_ of normal forms which prevent
>certain kinds of anomalys.

OK, you're using a definition of normalization that is quite different from any I've seen in a textbook. Essentially, given any database design and any set of business rules, you can make the DB be "normalized" (in this sense) with respect to the rules simply by adding constraints and triggers. So I'm not sure how useful your definition really is.

For instance, here's a standard example of a non-normalized database design:

CREATE TABLE Employee (
  EmployeeID int,
  EmployeeName char(30),
  DepartmentID int,
  DepartmentName char(30))

The standard RDB approach to fixing this table is to break it into two tables, one for employees and one for departments. But we could instead fix it by adding a bunch of constraints, triggers and stored procedures and leaving the table as is. For example, a constraint might state that whenever two rows share the same DepartmentID they must also share the same DepartmentName. A stored procedure for adding an employee might make the DepartmentName field optional and look it up among the already-existing rows.

I'm obviously not advocating this approach! However, it would make the database "normalized" according to your definition. I'd recommend that you adopt a stronger and more traditional definition if you want to make your arguments useful.

>>> Getting back to the employee/boss database... Consider the standard
>formulation, where the Employee table has a Boss attribute [adjacency
>list model]... I don't understand why you say it's not normalized. <<
>
>There is nothing wrong, per se, with a primary and foreign key in the
>same table. In fact, the FIPS-127 SQL Test suite had a test for this.
>That is not the problem. When I delete Chuck, I destroy the fact that
>the big boss is still the superior of all of Chuck's subordinates
>(deletion anomaly -- I destroyed a fact). The nested sets model
>preserves that fact. If I want to create a new position in the
>company, I cannot do it until I have a person to fill it (insertion
>anomaly).

Well, in your model, if you naively make changes to the lft and rgt attributes of some rows, you will destroy the integrity of your data. So you have to prevent updates being made by any mechanism other than a carefully-written stored procedure that maintains the integrity. If you're allowed to write stored procedures, then so am I. In the adjacency model, I can write a stored procedure to delete Chuck and automatically remap all of his subordinates to have Chuck's boss as their boss. In both cases, naive individual updates to the raw data can damage its integrity, but stored procedures can easily be written to make the necessary changes while maintaining the integrity. I don't see how one design is "more normalized" than the other, either using your weak definition or any standard definition.

As for "creat[ing] a new position in the company" without an employee to fill it, well, here it seems that you're changing the business rules on me. Now you want the database to represent "positions in the company", rather than employees. If you'd like to discuss possible designs for such a database, I'd prefer to start over. I'm assuming that this discussion is about how to represent employees in a company, not positions.

>>> I suspect you really want the lft and rgt attributes to have a
>continuous datatype such as float or fraction, rather than integer. If
>you use integers, then adding a new node to the tree may require a
>complex and expensive series of cascaded changes to the rest of the
>tree. This would be a serious violation of the intuitive idea of
>"normalization" that you seem to be working with. But with fractions
>you can probably avoid this difficulty. <<
>
>Nope, I want integers, altho any numeric datatype will work. Integers
>are small and fast to work with. A lot of the good tricks with the
>nested sets model depend on the fact that the number of nodes
>subordinate to node x can be computed by (rgt - lft +1)/2.
>
>Changing the structure of the tree requires ONE insert or delete
>statement to add or remove a subtree, followed by ONE update statement
>to recompute the lft and rgt values. They are a little messy, so I
>would put them into a stored procedure.

I haven't done a detailed complexity analysis, but it seems to me that there's a trade-off here between how quickly you can answer queries about the existing data versus how quickly you can make changes to the data. I agree that using integers allows queries to run a bit faster, and using consecutive integers allows certain queries to run significantly faster. However, it also makes all updates to the database MUCH slower than using the standard adjacency model. I was trying to offer you a way to speed up these updates by using floats or fractions instead.

If you know that you'll be querying the structure much more than you'll be updating it, your consecutive-integer approach might be a performance winner.

>Remember that I am working with SETS and do not have to do row at a
>time processing. You are still thinking in terms of navigation of the
>graph in the adjacency list model.

YOU may not have to do row-at-a-time processing, but the engine still does! I'm concerned about run-time cost here, not elegance of design or conciseness of source code. I'm not sure why you say that I'm thinking in terms of navigation of the graph in the adjacency list model. I was simply describing performance characteristics of _your_ model. Received on Thu Jan 13 2000 - 00:00:00 CET

Original text of this message