Re: two nasty schemata, union types and surrogate keys

From: Brian <>
Date: Sun, 20 Sep 2009 22:05:00 -0700 (PDT)
Message-ID: <>

On Sep 15, 2:12 pm, Sampo Syreeni <> wrote:
> I recently bumped into two patterns in schema design which I found
> interesting, especially since they seem to imply a real use for
> disjoint union types/domains and/or surrogate keys in schema design.
> Since that sort of thing is bound to be controversial, I thought I
> should share the patterns for discussion. The first concerns a design
> where it is seems a bit messy to go without composite keys containing
> nulls or similar placeholders.
> In this one we're cataloguing real persons. Since we have a global
> data set, we cannot rely on there being anything like a national
> identifier for each person, though for some nations we do have them.
> This is a partial key, then. For some other people we might have to
> resort to just recording a full name and a birthdate, for another
> partial key -- legislation might state that a national identifier is
> something that can be freely shared, but details of a person's age are
> not. Yet another country might only allow us to collect cell phone
> numbers and names, but neither birthdates nor id's. Without further
> information, we now have a collection of partial keys, none of which
> constitutes a key for all of the data. We also might have cases where
> one or more of the partial, and overlapping, composite keys exist at
> the same time.
> Mapping something like that relationally is a bit challenging. The
> first idea (and the actual pattern I've seen) is to allow nulls in
> keys (no primary key in sql, but a unique key) and to simultaneously
> enforce each partial key where applicable. The solution is workable,
> but a bit dirty: we don't really like nulls and inclusion dependencies/
> referential integrity constraints become rather difficult to define
> and enforce.
> The second option would be to separate all of the cases and create new
> relations, which then need not contain nulls at all. We'd have one
> relation for the cases where there is just an id, one for name and
> birthdate, one for name and phone number, one for id, name and
> birthdate but no phone number, and so on until all of the combinations
> are exhausted. But this rapidly leads to combinatorial explosion.
> Also, applying such an approach would also split every single relation
> which refers to a person into similarly many subrelations (because of
> the combinatory nature of the possible presence of partial key
> fields), with further combination if even a single extra entity
> referred to has a similar problem. While theoretically clearly the
> purest way to approach the problem in strict relational terms, in
> practice the solution would simply become unmanageable.
> Third, we might treat this as an example of the usefulness of
> surrogate keys. The complicated assembly of overlapping partial keys
> would then be treated as an intensional restriction, while the
> surrogate key would support referential integrity, both on abstract
> and concrete levels. (Of course such a key should really be opaque to
> the user, as Codd suggests.) But once again, even the use of
> surrogates is contentious.
> So, how about a typing solution? Suppose we enhance our type system to
> fully handle disjunctive and dependent types? So that we can have a
> field containing a tagged union of (phone number, birthdate and nid).
> That'd take care of a number of subcases. And if a field could
> actually contain a fully structured, composite, disjunctive, dependent
> type, like the union type of all of the combinations possible in the
> example, then one would only need to declare the mess to be a valid
> domain, and let the DMBS handle the specifics. The actual
> implementation would likely take the form of flattened relational
> tables with user-invisible nulls/marks/field-combination-coding, with
> added processing to enforce all of the possible partial keys
> simultaneously, and foreign keys containing any of the subsets for
> matching. The user visible model would be constructed along the lines
> of Codd's composite domains, so that it would be possible to
> simultaneously get at both the individual fields of the composite, and
> at the composite as a whole, with the DBMS enforcing propagation of
> present values over all relations utilizing the composite domain.
> Then a second nasty example which would actually seem to mandate
> surrogate keys, exclusively. The basic requirement that key
> constraints try to capture is unambiguous identifiability of all of
> the data in the system, so that a) updates can be told apart from
> insertions so as to be sure that no data duplication (or the attendant
> update anomalies/inefficiencies) arises within the database, and b) to
> guarantee that the data in the database can be reliably correlated
> with real world entities (i.e. that there is a clear, unambiguous
> means tell whether or not the logical theory encoded in the database
> really has the real world as one of its models).
> This way of looking at the motivation behind keys suggests that, in
> fact, natural keys might not always suffice to model all of the
> theories which we want to encode. The example comes from a schema
> already using surrogate keys, without any easy way to get rid of them.
> Here the theory is rather sparse: we want to encode people and their
> phones. The problem is that we simply cannot capture enough
> information on the people to form a proper, natural, primary key. We
> can assume for example that the only data we have on them is the name,
> so that there may be hundreds of John Smiths in the database, with
> little extra data. By itself, this sort of database would be
> impossible to maintain: there is no way to tell which possible updates
> should be applied to which John Smith.
> Then suppose each of the Johns will always have a personal cell phone
> account. Now the phone number combined with the name will usually be
> good enough to disambiguate between them. If each John was always
> guaranteed to have precisely one number, we'd define a composite
> primary key and be done with it. Except that people can have multiple
> numbers/accounts, each with their own dependent information like hours
> at which to call and the like. After that normalization tells us to
> make the number(n)-to-person(1) relation into a freestanding one.
> Suddenly unambiguous identification of a person is possible only if we
> take into consideration two different relations at the same time ("the
> John at (+358-555 or +358-666)), with no shared natural key.
> I can see only two ways in which to sort this all out. First, we could
> note that there is a kind of duality between this example and the
> first one: there we dealt with keys that are structurally complicated
> because of the combination of fields present, now we're dealing the
> keys similarly complicated because of the combinatorial structure
> within the field. As such, we could ostensibly go with type system
> modification as in the first example. We could permit set datatypes on
> the phone number, complicated key enforcement semantics, and domains
> for foreign keys which can express e.g. the semantics that "I'm
> referring to the John whose set of phone numbers contains at least one
> of the following..." That I think would totally break the relational
> model, and lead to a type system that is too powerful to ever be
> tractable.
> Or we could use surrogates.
> My own thinking is that opaque surrogates are the right way to go
> because they seem to solve both problems. But because of their
> inherent problems, we should also include fully general, Turing
> complete, yet preferably declarative integrity checking mechanisms
> (i.e. akin to Prolog) which can enforce both single table integrity
> constraints, and constraints relating to the database as a whole. That
> way we'd aim at efficient, single table checks and index based cross-
> table checks, but would also be able to automatically enforce more
> complicated integrity rules, of the kind my two examples illustrate.
> Tell me what you think, and sorry for going into essay length.

Databases don't record objects: they record facts about objects. In the first scenario, the facts you're recording don't have the same predicate, even though they all are about objects that have the property 'being a person.' 'being a person with national id <nid>' implies 'being a person.' 'being a person named <name> with birth date <bdate>' implies 'being a person.' The obvious solution is to introduce a relation with the predicate 'being a person <p>' and to modify the other relations' predicates 'being a person <p> with national id <nid>' and 'being a person <p> named <name> with birth date <bdate>.' This way there is one relation for facts that state just that there is a particular person. By isolating the implied predicate--the one that asserts just the objects' existence, there can be a bijective mapping from the tuples in that relation to the objects in the universe of discourse that have the property 'being a person.' Of course this requires the introduction of surrogates, but logically, surrogates are nothing more than individual constants: arbitrary symbols for particular objects in the universe. I see absolutely no difference between assigning arbitrary symbols to objects that have the property 'being a person' and assigning arbitrary symbols to objects that have the property 'being a natural number.' The symbols '1' and '2' are in essence 'surrogates' for the first two natural numbers.

I think this is what Codd had in mind when he introduced the Erelations  of RM/T. E-relations are primarily for asserting entities' existence. Received on Mon Sep 21 2009 - 07:05:00 CEST

Original text of this message