Re: foreign key constraint versus referential integrity constraint

From: vldm10 <vldm10_at_yahoo.com>
Date: Mon, 26 Oct 2009 03:20:52 -0700 (PDT)
Message-ID: <ff8af9ba-5136-4e0d-a328-3cb0ba3012fe_at_j9g2000vbp.googlegroups.com>


On Oct 21, 11:36 pm, Sampo Syreeni <de..._at_iki.fi> wrote:
> > (Since I first saw Codd's 1970 paper I've never been able to understand
> > why on the one hand he talked about "symmetrical" access but on the
> > other he used the very restrictive term "foreign key".)  For me, a
> > "reference" is a reference even if most dbms' have similar restrictions
> > to Codd's term.
>
> I think the later talk about domains and e-relations in RM/T (and to a
> lesser extent in RM/V2) makes that a bit easier to understand. As I
> interpret it, Codd started with a very clean and purely syntactic
> model in his seminal article, which will definitely lead to pure
> symmetry in access. But from the very start he also worried deeply
> about the semantics and making them explicit as metadata, which leads
> to the second influential discovery of his: the integral, reflective
> data dictionary. Later on his development of the relational model not
> only added surrounding functionality (which weighs in so heavily in RM/
> V2, what with all of'em T-joins and the lot) but semantics as well.
>
> I think what was driving Codd was the need to make his model fully
> self-contained. If you really think about it, that tends to exclude
> any and all free-form input by the user. I mean, let's say you let an
> end user input a free form name. It's pretty much guaranteed that lots
> of nonsense is going to be input then. After that the name field
> doesn't really constitute a well-defined domain at all; to a
> mathematician like Codd its proper type should probably be more or
> less 'general-ape-shit;no-real-type;use-at-own-risk'. I.e. the field
> could be straightforwardly likened to a blob of text as we call the
> more sustained kind of drivel today; certainly not something one would
> make into a domain, and not something which would fit Codd's neat idea
> of formal, repetitive business data.
>
> So, what he did was to go the way we'd now call Master Data
> Management. He went on to separate all of the data we'd intuitively
> call entities, and more formally would define as a) relatively stable
> data b) centrally and well managed as a fully controlled vocabulary,
> so that c) applying the closed world assumption to it is actually well
> founded because there is a formal process to keep the real world and
> the minimal model of the database trying to approximate it in close
> correspondence, d) attaching to things that have real, societally
> established permanence, uniqueness and omnipresence which can be
> likened to "a candidate key on an established class of real world
> entities which is at most slowly varying", and most of all e)
> discrete, enumerated data that is mostly used to classify/qualify
> other data, the latter usually being composed of abstract measurements
> on a continuous scale (originally account balances, given that we're
> after all talking about developments which happened within the
> International *Business* Machines Corporation). What we'd nowadays
> call "complex", "semistructured", "unstructured", "unconventional" or
> even "nasty" data was explicitly excluded by Codd from the very
> beginning.
>
> That then lead to a sort of revival of the earlier intuitive idea of
> entities as opposed to relationships. Only this time there was a clear
> distinction in the criteria: relations were always relations, and they
> represented a theory of the outlying reality that was perfectly
> symmetric. Nevertheless, the connection to the real world was also
> taken into account in that some relations had additional semantics
> bolted on in addition to just the logical constraints. They referred
> to real world entities with the kinds of properties that would make it
> especially easy to keep the correspondence between the minimal model
> of the DB and the real world the DB was trying to represent in synch.
>
> The difference to today's MDM and OLAP-dimension frenzy was then that
> the logical level design criteria were fully unaffected by such
> semantic, real world concerns. They remained completely syntactic, as
> they should have. The only thing added was the idea that some
> relations which happened to have a certain commonly reoccurring
> structure, and in addition were meant to semantically correspond to
> real life in what was also a commonplace correspondence, could have
> those facts formally notarized in the data dictionary, using a well-
> known syntax.
>
> That, and only that, was what made a relation an e-relation. You
> neither could nor had to reference the e-relation, but still its
> semantics necessarily made it so that a) it made no sense to refer to
> any real life modelable thing that wasn't already modeled by the e-
> relation, b) thus semantically speaking we had an inclusion dependency
> towards the e-relation already, c) of course all such semantic
> constraints should be enforced if possible, d) a foreign key
> constraint is pretty much The Mechanism to do that in the databases of
> the past couple of decades, and finally e) foreign keys seem an awful
> lot like unidirectional links, eventhough they could and should be a
> lot more besides.
>
> So, in addition to the semantics, we also have the direct implication
> that an e-relation fully defines a domain (though not all domains need
> to be defined via such enumeration). Then it's quite clear that all
> all relations referring in any way to the kind of real life object the
> e-relation represents, should be limited to utilizing the
> corresponding domain/type determined by it as the type of the
> attribute. (In SQL databases that means declaring foreign keys
> referring to primary keys. In my ideal database it would amount to
> declaring domains, and in addition to those also full inclusion
> dependencies which could span any combination of attributes on both
> sides, perhaps even using views in the middle to cater for the more
> complex cases; I take constraints rather seriously.)
>
> At the logical level all that was done was to declare some
> dependencies, syntactic constraints, and telling a little bit about
> how to read/update the data to a live person, using a controlled
> vocabulary to do so (e.g. "social security number is the primary key
> and is supposed to uniquely identify persons; thus if you have a
> duplicate number, it's semantically fucked, leaving you only the
> option of gathering even more data, until you can again make the
> minimal model of the database correspond to reality").
>
> But of course the all-round inclusion dependency also closely mimics
> earlier referential semantics in its effects, and the usual way we
> tend to use data typically reinforces the notion. You rarely see the
> symmetry ("there is no link, there are just equal values"). Still, I
> can tell you that when you do ad hoc OLAP queries long enough, you'll
> eventually find yourself starting the query "against the tide". That
> is then when you finally get it -- you simply couldn't have done it in
> the presence of a directional link without "going through the whole
> fucking thing using a for-loop" -- and off you are, into the
> relational brotherhood. :)
>
> > It seems arbitrary and unnecessarily doctrinaire to
> > restrict references to only primary keys and unique indexes because I
> > think if one wants a particular constraint, one should not have specify
> > two constraints, eg., a 'reference' as well as a key or index.  This
> > doesn't prevent a language from offering shortcuts that are understood
> > to involve both.)
>
> That is arbitrary indeed, and one of the few things where genuine
> progress has been made in the theory of dependency and normalization
> on the issues Codd originated. I also think that it is one of the very
> few points where Codd as well fell victim to the earlier record
> centric thought -- "referential integrity" often implies the mentality
> that we're "linking" from some dependent data to stuff that "is
> defined and primarily lives at another place". That is precisely the
> record and index centric way of doing things, whereas the central
> relational innovations were to a) place such references squarely at
> the logical level, which makes them fully symmetric (i.e. in the
> current case we might as well say that rvOrders refers to
> rvRedemptions as we could say the opposite; it's not about directional
> linking, but about referring to the same logical entity in two
> different places at once, and just defining the propositional
> semantics correctly), b) focus attention on the formal properties,
> modelling and constraints (e.g. inclusion dependencies) within the
> data instead of playing by real-world intuition alone, and c)
> abolishing even the idea that base entities and relationships should
> somehow be separated (i.e. what matters is only the formally
> verifiable, syntactic dependencies that will necessarily result from
> faithfully modeling the data, which then reveals that relationships
> and base entities really behave much the same way).
> --
> Sampo

E-relations in RM/T have been introduced in order for the properties of an entity to be represented as binary relations. As Codd wrote, “split into as many binary relations as there are properties to be recorded.” In these binary relations, surrogates have a key role. E. Codd writes in RM/T, “Each surrogate appearing in this e-attribute uniquely identifies the entity being described.” E. Codd, however, does not show or solve the only important thing in this paper : how to decompose any relation into binary relations. In fact, in the RM/T he only expressed the desire for any relation to be represented through binary relations.

E. Codd is undoubtedly among the most significant people in the history of computer science and his work is of great and lasting value. However, the e-relations and surrogates introduced in his paper RM/T don’t have theoretical significance.

Vladimir Odrljin Received on Mon Oct 26 2009 - 11:20:52 CET

Original text of this message