Re: foreign key constraint versus referential integrity constraint

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Sun, 25 Oct 2009 15:54:16 -0700 (PDT)
Message-ID: <ffc9385a-99a6-4b11-a8e1-684d5c5d4f74_at_p9g2000vbl.googlegroups.com>


On Oct 24, 7:53 pm, Keith H Duggar <dug..._at_alum.mit.edu> wrote:

(Sorry for the resend, Google Groups seems to be taunting me...)

> I agree that it is possible to enforce such join semantics with
> domains but I'm not sure it is entirely appropriate.

Uhm, being the practical kind of guy, I wonder where you're going with this. I mean, you too agree that domains/types do get the job done. Was there something else you wanted in addition to that?

> To put this another way, the RM should be just as complete and
> effective for a universe with a single domain as it is for a
> universe with a rich domain set (such as you propose to handle
> my GenericID join constraint example).

And it is. There is no reason why the relational model wouldn't work just fine with only a single type, such as an integer. (Encoding then, say, textual data, would become somewhat convoluted, but it could be done.) So the type system is indeed completely orthogonal to the model.

However, that only means that when we model data, the relational model by itself isn't sufficient to fully handle all of our conceptual modelling needs. It's a very general theory, and as such it can benefit from the addition of more specific constraints and semantics, for example in the form of a rich type system.

From this point of view I would then say that your original question about joinability is something that sits squarely within the type system, and thus is itself fully orthogonal to the relational model. Trying to somehow make the skeletal relational model responsible for enforcing such constraints seems much akin to asking why the hammer ain't working too well with the cream, right after casting aside the whisk.

> By pushing such expression constraints off to domains, we are
> admitting that the RM itself has no support for constraining
> relational expressions and I'm not prepared to admit that
> limitation. Especially since it does provide support for
> constraining the values of relational variables.

To reiterate, it seems to me that you're now trying to have your cake and eat it too. First you say that domains, a particular implementation of a type system, are outside of the relational model. Then you go on to claim that the relational model has a problem because in this amputated form it cannot handle a problem that is caused by the lack of a type system.

The way I see it, both Date and Codd advocate a view of the relational model which already includes a rich type system, and actively utilizes it to express e.g. join semantics. Domains, they're Codd's particular idea of how to handle typing in a form that actually reflects much of the relevant metadata about the types back into a shared encoding that is accessible under relational semantics. Granted, there will be other ways of handling types, but Codd's idea doesn't seem all that bad to me, and is also amenable to straightforward extension should we find it somehow lacking.

> I will also note that both Codd and Date proposed RM mechanisms
> to operate the other way around ie to allow designers to force
> cross-domain operations. Codd with "Domain Check Overrides" and
> date with the THE_ operators that provide type coercion.

True, but I would be inclined to believe that in practice those mechanisms would only be used to circumvent modelling errors, before the conceptual model was changed to explicitly allow the join. Most commonly that would probably include the addition of a common supertype, which Bob Badour already mentioned as a necessary prerequisite for joinability. Under the domain model, this fact would also be explicitly recorded in the data dictionary.

> I think there are other views, see above.

As I said, I could think of "at least two" ways to view it. However, I don't think your quotation from Date falls outside of either of them.

> In short, we should not let our lack of imagination guide our design principles.

There we agree fully. My point is just that a) yes, joinability is something that probably should be explicitly constrained in the database, b) domains at least get the job done, c) unlike abstract typing, domains also reflect type metadata back into a formal, shared and accessible form within the database which is a clear plus, so that d) in the absence of a viable alternative, they or something closely related to them is what we should use until we think of something even better.

> It's not for us to wonder such whys because our imagination is
> usually quite limited. For example, GenericID seems an entirely
> reasonable thing to me if I labelled every item I sent out (even
> the coupons) with a barcode and have other relations expressing
> facts about those barcodes apart from what they label.

In that case, it's the systematic barcoding of certain real world items that makes the union type sensible. The union type could then be given the name of BarcodedItem, and could perhaps have its own dependent data, like type of barcode, printing method, and so on.

My point was simply that you have to have a valid, real life/business reason, or semantic justification for the union type/generalization. Since you just gave one, I'm happy with that. At the same time you should then note that under these semantics, you cannot add to the union type willy nilly. For example including people in it is out of the picture, unless you actually insist on branding your customers with a barcode. ;)

> Anyhow, the question here is not one of our imagination but rather
> simply this: if it makes sense for the RM to support constraints
> on relational /values/ (taken on by variables) why does it not
> make sense to support constraints on relational /expressions/?

I don't really see what the connection with this and the above is, but again we're in full agreement: indeed support for constraints on relational expressions should be available. And it in fact would be if RDBMS vendors just implemented generic support of SQL assertions.

> > If you have the means of keeping such ID's private -- this would
> > necessitate making the ID's fully opaque from an outside user's point
> > of view, which cannot be done in any DBMS I know of -- you might just
> > be able to justify their existence as surrogates. But this is still
> > stuff that requires tremendous discipline and insight into the entity
> > integrity issues that need to be addressed in addition to the
> > referential integrity between the surrogates. I wouldn't recommend
> > going down this slippery slope unless you're *absolutely* sure you
> > know what you're doing -- I haven't seen a single wider scale
> > deployment which got this right and didn't suffer integrity issues in
> > the long run.
>
> Frankly I don't understand the point of the above paragraph at
> all.

The point is that when you start creating union types, and so the keys used to them willy nilly, you tend to rapidly get to the point where the resulting space of identifiers has no semantic relevance. What you then have is a space of invented, autogenerated, and so on, identifiers. Those cannot be correlated against the real world entities they represent in any way except by reference to the data dependent on them. What you then have is surrogate keys, and if you take a look at RM/T and RM/V2, Codd makes an eloquent case against revealing them to the users. Because making them truly opaque is then not supported by current RDBMSes, I would hesitate to use them at all.

> Also I don't see how entity integrity enters into this at
> all; I'm assuming we follow the sage advice of having no NULLs
> to begin with; so entity integrity is non-issue.

Heh, I did say going that way requires tremendous discipline and insight...

At its most general, entity integrity has nothing to do with nulls. Rather it has to do with whether you can reliably correlate the data in your database with real life facts. It is essentially a semantic construct which tries to ensure that it is possible to unambiguously update the contents of your database against the changing reality we're trying to capture/encode.

This is not a process that can be done strictly at the logical, relational level. All we can do there is to to put up syntactic constraints to help in the process along. The main, commonest, simplest, and especially easiest to implement example is the key. It tries to capture the semantics of real life things/entities a) always having a certain set of characteristics we can use to correlate them with the data we have on them (i.e. the not null syntactic constraint), and b) being separable from each other once we're privy to that set of characteristics (i.e. the uniqueness constraint).

That means that entity integrity is only certain to be maintained when a) the set of facts we actually have on real life objects/concepts is sufficient to uniquely identify their representation in the database, b) in full generality, that does *not* mean that all of the data on the primary key is available, only that all of the data required to identify any candidate key is present, and that in case c) does not mean that the identification has to take a form that only involves one table; the data might be distributed all over your database, and in general will be if you're using something other than full, natural keys to link your data together.

Maybe a recycled example could make this clearer... Suppose you have persons and the cars that they own. You're keying them both using surrogates, which obviously have no real world significance because you cannot take a look at a person or a car and immediately read off the corresponding surrogate. (Tattooing the number on each person or carving it into the manufacturer's stamp in a car would make the key natural, and not a surrogate.) You can clearly have a not null and unique primary key for each of the entities, but that no longer guarantees entity integrity: there might be more than one person called John Smith, and more than two cars with the make of Audi. In order to maintain entity integrity, the syntactic constraint that the surrogates be not null and unique is not sufficient; when you try to update your database to reflect reality, you don't necessarily know which John Smith is which, and certainly you can't tell one Audi apart from the other. In particular, you don't have any reliable means of stopping somebody from adding yet another John Smith or Audi that are already described in your database; suddenly duplicates and the resulting update anomalies will abound.

So the first lesson is that surrogates don't really work towards entity integrity. Only keys referring to real life, measurable/ observable things, that is natural keys, do that.

Then, secondly, there is the more complicated case of distributed information over multiple tables. Suppose that mess of surrogates you have tells you that there are two John Smith's, which cannot be easily told apart from each other. But then you notice that doing a join against the table of vehicles owned tells you that the two Johns own mutually disjoint sets of vehicles, which you *can* tell apart from each other by their licence plate numbers (i.e. a natural key). Suddenly you can tell those people apart at some level of certainty -- and counterwise if the sets overlapped, you could be perfectly certain that entity integrity has been violated, the two Johns are actually the same person, their records need to be unified/deduplicated, and the update has to be cascaded into the rest of the database to reflect the new knowledge.

Here the lesson is that keys and the related single table reasoning aren't all there is to entity integrity, and that surrogates usually make the situation so complicated that establishing entity integrity even at the syntactic level can easily force you to scan through your entire database. Which of course nobody ever does when using surrogates, which then almost surely leads to integrity issues in the long run.

Formally speaking, then, entity integrity is an inseparable part of integrity as a whole, and it involves not just syntax but semantics as well; especially model theory. Since a database essentially contains a logical theory, the first part is about internal consistency; the theory has to be consistent in order to have a real life model in the first place. The second, more difficult part is that the theory also has to be refined enough to exclude real life models which are inappropriate, such as the one where we have one John Smith in the database and a million of them in the wild, with no means of telling which one is which at the logical level. And finally, third, the hardest part is to strike the proper engineering balance between maintainable detail, availability of real life identifying data and the processes of efficiently correlating reality with the symbolic universe represented by the database.

More often than not, then, surrogates and the improper generalizations they imply hinder this process. Most often by creating a false sense of complacency which obscures the real, underlying, logical issues which need to be addressed. Then the result is cumulative damage to overall integrity via simple laziness and neglect.

> I don't think I'm talking about "physical level" types (posreps)
> at all.

You seem to be, because your conflating the two types of identifiers, simply because they're, well, "identifiers". That don't mean that they semantically are of the same type, eventhough they might look like that when they carry the same physical type (posrep).

> Consider Date's point that RM is orthogonal to type support.
> Then what argument from principle do you have for prohibiting
> the RM from defining constraints on relational expressions by
> pushing that responsibility off to domain support?

None. Thus I think we're actually in what they call "vigorous agreement".

> How would we express the constraints in a universe with one domain?

In a highly convoluted fashion. Such a convoluted fashion that I'm not inclined to give an example. Nor would I probably be inclined to go through such an exercise when designing a database.

That, to me, suggests that types and domains simply make life easier, and should be actively utilized when modelling data.

--
Sampo
Received on Sun Oct 25 2009 - 23:54:16 CET

Original text of this message