two nasty schemata, union types and surrogate keys

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Tue, 15 Sep 2009 11:12:59 -0700 (PDT)
Message-ID: <3ab735ba-6273-4b6e-8551-127148c4e55a_at_q35g2000vbi.googlegroups.com>



I recently bumped into two patterns in schema design which I found interesting, especially since they seem to imply a real use for disjoint union types/domains and/or surrogate keys in schema design. Since that sort of thing is bound to be controversial, I thought I should share the patterns for discussion. The first concerns a design where it is seems a bit messy to go without composite keys containing nulls or similar placeholders.

In this one we're cataloguing real persons. Since we have a global data set, we cannot rely on there being anything like a national identifier for each person, though for some nations we do have them. This is a partial key, then. For some other people we might have to resort to just recording a full name and a birthdate, for another partial key -- legislation might state that a national identifier is something that can be freely shared, but details of a person's age are not. Yet another country might only allow us to collect cell phone numbers and names, but neither birthdates nor id's. Without further information, we now have a collection of partial keys, none of which constitutes a key for all of the data. We also might have cases where one or more of the partial, and overlapping, composite keys exist at the same time.

Mapping something like that relationally is a bit challenging. The first idea (and the actual pattern I've seen) is to allow nulls in keys (no primary key in sql, but a unique key) and to simultaneously enforce each partial key where applicable. The solution is workable, but a bit dirty: we don't really like nulls and inclusion dependencies/ referential integrity constraints become rather difficult to define and enforce.

The second option would be to separate all of the cases and create new relations, which then need not contain nulls at all. We'd have one relation for the cases where there is just an id, one for name and birthdate, one for name and phone number, one for id, name and birthdate but no phone number, and so on until all of the combinations are exhausted. But this rapidly leads to combinatorial explosion. Also, applying such an approach would also split every single relation which refers to a person into similarly many subrelations (because of the combinatory nature of the possible presence of partial key fields), with further combination if even a single extra entity referred to has a similar problem. While theoretically clearly the purest way to approach the problem in strict relational terms, in practice the solution would simply become unmanageable.

Third, we might treat this as an example of the usefulness of surrogate keys. The complicated assembly of overlapping partial keys would then be treated as an intensional restriction, while the surrogate key would support referential integrity, both on abstract and concrete levels. (Of course such a key should really be opaque to the user, as Codd suggests.) But once again, even the use of surrogates is contentious.

So, how about a typing solution? Suppose we enhance our type system to fully handle disjunctive and dependent types? So that we can have a field containing a tagged union of (phone number, birthdate and nid). That'd take care of a number of subcases. And if a field could actually contain a fully structured, composite, disjunctive, dependent type, like the union type of all of the combinations possible in the example, then one would only need to declare the mess to be a valid domain, and let the DMBS handle the specifics. The actual implementation would likely take the form of flattened relational tables with user-invisible nulls/marks/field-combination-coding, with added processing to enforce all of the possible partial keys simultaneously, and foreign keys containing any of the subsets for matching. The user visible model would be constructed along the lines of Codd's composite domains, so that it would be possible to simultaneously get at both the individual fields of the composite, and at the composite as a whole, with the DBMS enforcing propagation of present values over all relations utilizing the composite domain.

Then a second nasty example which would actually seem to mandate surrogate keys, exclusively. The basic requirement that key constraints try to capture is unambiguous identifiability of all of the data in the system, so that a) updates can be told apart from insertions so as to be sure that no data duplication (or the attendant update anomalies/inefficiencies) arises within the database, and b) to guarantee that the data in the database can be reliably correlated with real world entities (i.e. that there is a clear, unambiguous means tell whether or not the logical theory encoded in the database really has the real world as one of its models).

This way of looking at the motivation behind keys suggests that, in fact, natural keys might not always suffice to model all of the theories which we want to encode. The example comes from a schema already using surrogate keys, without any easy way to get rid of them. Here the theory is rather sparse: we want to encode people and their phones. The problem is that we simply cannot capture enough information on the people to form a proper, natural, primary key. We can assume for example that the only data we have on them is the name, so that there may be hundreds of John Smiths in the database, with little extra data. By itself, this sort of database would be impossible to maintain: there is no way to tell which possible updates should be applied to which John Smith.

Then suppose each of the Johns will always have a personal cell phone account. Now the phone number combined with the name will usually be good enough to disambiguate between them. If each John was always guaranteed to have precisely one number, we'd define a composite primary key and be done with it. Except that people can have multiple numbers/accounts, each with their own dependent information like hours at which to call and the like. After that normalization tells us to make the number(n)-to-person(1) relation into a freestanding one. Suddenly unambiguous identification of a person is possible only if we take into consideration two different relations at the same time ("the John at (+358-555 or +358-666)), with no shared natural key.

I can see only two ways in which to sort this all out. First, we could note that there is a kind of duality between this example and the first one: there we dealt with keys that are structurally complicated because of the combination of fields present, now we're dealing the keys similarly complicated because of the combinatorial structure within the field. As such, we could ostensibly go with type system modification as in the first example. We could permit set datatypes on the phone number, complicated key enforcement semantics, and domains for foreign keys which can express e.g. the semantics that "I'm referring to the John whose set of phone numbers contains at least one of the following..." That I think would totally break the relational model, and lead to a type system that is too powerful to ever be tractable.

Or we could use surrogates.

My own thinking is that opaque surrogates are the right way to go because they seem to solve both problems. But because of their inherent problems, we should also include fully general, Turing complete, yet preferably declarative integrity checking mechanisms (i.e. akin to Prolog) which can enforce both single table integrity constraints, and constraints relating to the database as a whole. That way we'd aim at efficient, single table checks and index based crosstable  checks, but would also be able to automatically enforce more complicated integrity rules, of the kind my two examples illustrate.

Tell me what you think, and sorry for going into essay length. Received on Tue Sep 15 2009 - 20:12:59 CEST

Original text of this message