Re: Bill of materials / Groups in Groups

From: Harry Chomsky <harryc_at_chomsky.net>
Date: 2000/01/27
Message-ID: <3v4k4.40$Ro5.3336_at_nnrp3-w.snfc21.pbi.net>


Sorry to be so long-winded, but this discussion has brought up some pretty intriguing ideas (and misunderstandings) about relational theory that I think are interesting to explore.

joe_celko_at_my-deja.com wrote in message <86nt0i$27i$1_at_nnrp1.deja.com>...
>Why do compound domains bother you? Nobody has any trouble with a
>compound key.

Well, the short answer is simply that a key (or, more generally, an FD) is a characteristic of a group of attributes, while a domain is a characteristic of an individual attribute. FDs and domains are different things, and they're defined in different ways.

Now, you might wonder _why_ these concepts are defined the way they are. What would happen if we allowed a domain constraint on a group of attributes? Consider a table like the following:

CREATE TABLE Employee(
  Name CHAR(20),
  MonthlySalary MONEY,
  AnnualSalary MONEY)

Obviously it's not well normalized -- it's not in BCNF because of the two FDs MonthlySalary -> AnnualSalary and AnnualSalary -> MonthlySalary. Now let's write down a constraint that requires the pair (MonthlySalary, AnnualSalary) to belong to the set of ordered pairs of money values (a, b) satisfying b = a * 12. This is the kind of "compound domain constraint" you are trying to justify. In this case, this constraint _implies_ the two FDs that I mentioned above. In fact, all of the constraints for this table follow from the following two constraints:

  1. Attribute Name is a key.
  2. Attributes (MonthlySalary, AnnualSalary) belong to the "domain" of money value pairs (a, b) such that b = a * 12.

According to your logic, that means that the table is in DKNF! I don't think this altered notion of DKNF is very useful.

>Would you think that (x,y) and (x,y,z) co-ordinates are
>not a domain? Longtitude and latitude? Polar co-ordinates? (lft, rgt)
>are not scalar attributes -- one without the other is meaningless, just
>like (x,y) co-ordinates.

Ah, so now you're suggesting something very different from what you wrote down before: now you want to view (lft, rgt) as a single attribute defined on a complex domain, rather than two separate attributes each defined on a simple domain. That makes all the difference. Continuing with my example above, let's modify the table to have just two attributes, Name and Salary, where Salary is defined on the domain of money value pairs (a, b) such that b = a * 12. Now the table is in BCNF and DKNF. (In fact, the complex domain that I've described is isomorphic to the simple domain MONEY, so a Salary value could be represented physically using a single value of type MONEY, which would be multiplied by 12 to express the annual salary.)

It's important to note that I've described two different table structures, one normalized (BCNF and DKNF) and the other not (neither BCNF nor DKNF). There's a fundamental difference between a pair of attributes that must together satisfy some constraint, and a single attribute containing an ordered pair of two values. In neither case is it reasonable to talk about a "domain constraint" applying to a combination of attributes.

Getting back to the nested sets model, if you want the constraint "lft < rgt" to be a domain constraint, you'll have to define the values (lft, rgt) as a single attribute defined on a compound domain. I don't know how to do this in SQL, but I do know what it means, and I'll assume that's what you intend to do.

Note that you can no longer state the uniqueness constraint on lft as a key constraint, because lft is now no longer an attribute of your table. You can still express the uniqueness constraint on pairs (lft, rgt). If you want to require each lft value to be unique, you'll have to show that that follows from your other constraint, "subordination". (I'm not sure offhand whether it follows or not. If not, it's probably easy to modify the constraint so that it will.) But that leads us into the next part of the discussion...

>The set for a domain does not have to be pre-specified. It can be
>defined by extension (i.e. by examples: sex = {0,1,2,9} in the ISO
>Standard, with the meanings 'Unknown', 'male', 'female' and "Not
>Applicable - lawful person'). But it scan be deined by intention as
>well (i.e. by a rule: odd = {i : i is an INTEGER AND MOD(i, 2) = 1}.

I don't really care whether you define the set for a domain by extension or intension -- as long as it's a set. My objection is that you are trying to use a _set-valued function_ instead of a set. Remember, the constraints for a relation scheme are defined along with the _scheme_, not along with any particular instance of the relation. A domain constraint states that for every instance of the relation, for every tuple in the instance, for some attribute A, the value of that attribute of that tuple must belong to a certain set. What is the set in this case? "The set of (lft, rgt) pairs which do not overlap any of the intervals (lft', rgt') for the other tuples in the relation instance." Unfortunately, that sentence is meaningless. It refers to "the relation instance", but in the context in which the sentence is written, there is no instance -- there is only a scheme. In logical terms, we might say that the specification of the set contains a free variable. Therefore the specification really describes a _function_ that takes a relation instance and returns a set. A constraint that uses such a function does not qualify as a domain constraint.

Unfortunately, this is one of those concepts that can be quite obvious to a mathematician or programmer but quite hard to express clearly. So forgive me if I haven't gotten the point across. But it's a very fundamental point.

Let's see what would happen if we amended the definition of "domain constraint" to allow this sort of set-valued function in place of a set. Let's call this kind of constraint a "Celko constraint" instead of a "domain constraint", and let's call the resulting normal form "JCNF" instead of "DKNF". Can we formalize the notion?

First attempt: A Celko constraint specifies an attribute A and a function f from relation instances to sets. An instance r satisfies the constraint if for every tuple t in r, the value of attribute A of tuple t is an element of the set f(r).

Unfortunately, this definition doesn't work very well. Almost every relation scheme is in JCNF by this definition. Let R be any relation scheme with at least one attribute. Let A be an attribute of R, defined on some domain D. Let C be the set of constraints defined on R. The constraints can be as complex as you want. Now write down the following Celko constraint:

"The value in attribute A must belong to the set which is D (if the relation instance r satisfies constraints C) or the empty set (if r doesn't satisfy C)."

This is a valid Celko constraint, and it's almost completely equivalent in strength to the set of constraints C. The only possible difference is that the empty instance always satisfies this constraint, while it may or may not satisfy C.

It follows that if a relation scheme has at least one attribute and allows the empty instance, then it is in JCNF. Clearly not a useful notion of normalization!

Another problem with this notion of JCNF is that it admits insertion anomalies and deletion anomalies. A tuple to be inserted may satisfy the Celko constraint, yet after it's inserted it may no longer satisfy the constraint. A relation instance may satisfy the constraint, but inserting a tuple t may cause some pre-existing tuple t' to suddenly violate the constraint, even though t' was valid before t was inserted. Deleting a tuple t from a valid instance may cause some other tuple t' to violate the constraint even though t' was valid before the deletion. There can even be "anti-anomalies": a tuple to be inserted may appear to violate the constraint, but inserting it may be valid anyway if it will satisfy the constraint _after_ being inserted.

As you can see, there's no end of trouble with this kind of free-form constraint. We can improve matters somewhat by adding a "contravariance" requirement to our definition. Let's call the set-valued function "contravariant" if, whenever a relation instance r' is a subset of an instance r, then f(r) is a subset of f(r'). Now let's try the definition again.

Second attempt: A Celko constraint specifies an attribute A and a _contravariant_ function f from relation instances to sets. (the rest as before)

Your "subordination" example fits this definition: the function "all (lft, rgt) pairs not overlapping an existing pair" is contravariant.

It's no longer the case that virtually every relation scheme is in JCNF. It's easy to show that relation schemes in JCNF don't suffer from deletion anomalies or anti-anomalies. But they still may suffer from insertion anomalies. With contravariance, the constraint may become more severe as tuples are added to the table; in particular, inserting a tuple that looked ok before the insertion may suddenly make the new tuple become invalid, or may make other previously valid tuples become invalid.

I haven't figured out any way to restrict the definition of the function f to ensure that a relation in JCNF will not suffer from insertion anomalies. Can anybody see a good way to do this, yet still allow the nested sets model to qualify as JCNF?

In defense of the nested sets model, I should point out that it actually doesn't suffer from insertion anomalies (assuming that we define "insertion anomaly" in a manner appropriate to JCNF rather than DKNF). But it would be nice if we could come up with a good definition of JCNF, then say "The nested sets model is in JCNF, therefore it doesn't suffer from insertion or deletion anomalies." I hope someone can help me do this.

We also have to consider whether the set-valued function should be allowed to have other arguments besides the relation instance itself. For instance, it might take "the tuple in question" as an additional argument. The "subordination" constraint, as stated, does require this, since it refers to another attribute of the tuple in question (OrgChart.emp) as well as attributes of the variable C1 representing an arbitrary tuple from the table. In this case it should be easy to restate it so that it doesn't make this reference. But in general, the question of reference to the tuple in question may be complicated, and may be closely tied to the issue of insertion anomalies (where the "tuple in question" is part of the relation in one analysis and not in the other analysis). I don't have good answers to these questions. I think the fact that they have to be asked highlights the difficulties of trying to define a new category of constraint and a new normal form.

>What you have with (lft,rgt) is a self-reference. Given (a,b) I can
>determine if it is in the domain of ordered pairs of integers for my
>tree by a few tests:
> 1) (a < b)
> 2) (a > 0)
> 3) (b > 1)
> 4) UNIQUE(a,b) -- which is just part of being a set
> 5) UNIQUE(a)
> 6) UNIQUE(b)
> 7) Does the range of (a,b) overlap the range of an existing pair?
>
>The last one is the self-reference. I suppose if you wanted to consider
>the structure of every possible tree as a nested set model to be a
>domain set, you could.

Sorry, I don't understand your last sentence at all. What set (or set-valued function) are you proposing as the domain for the pair (lft, rgt)? Received on Thu Jan 27 2000 - 00:00:00 CET

Original text of this message