Re: two nasty schemata, union types and surrogate keys

From: Sampo Syreeni <>
Date: Sun, 20 Sep 2009 11:47:22 -0700 (PDT)
Message-ID: <>

> The natural key for persons all around the globe that comes to my mind
> would be some DNA profile, what would however not be a practical
> solution (difficult to get, too long, data protection issues).

It ain't, because of identical twins. That sort of key is also too fragile at the moment; it would be kind of like keying people off their photos. I mean, usually the problem with formalized descriptors like phone numbers, names, ages and the like is that the theory embodied by the database could fit multiple real life models. I.e. we have a total surjective relation from data to the representative real life entities. In this case, however, the problem goes in the opposite direction: there is simply too much redundancy and noise in the real world fact for it to count as a key.

So in the end, we'd really like a total invertible function between the entities and their representations in the database. Or in other words a theory that uniquely fixes a single model.

> But what about using an artificial key?

That's what surrogate keys are all about. But as Codd suggested, such keys should remain internal to the DBMS. Only strictly typed equality comparisons should be allowed on them. Precisely because they are artificial, and as such will easily lead to an invented mismatch between the logical content of the database and the reality it is supposed to represent.

> I. e. create your own encoding scheme, for example SMITHPA00001 for a person called
> "Paul Smith".

The problem is that you could have multiple Paul Smiths. Okay, then you can enumerate them, and what you have is basically a specially formatted surrogate key. But how precisely do you then know which record should be updated when e.g. Paul's age changes? And if you cannot discern that, why precisely do you have those two separate Pauls as separate records?

The point is that a) if you cannot tell two real life entities apart based on their representation within the database, then b) you cannot tell any dependent data apart either, so that c) the representations within the database should actually be merged, in order to avoid data duplication, update anomalies and the like. Received on Sun Sep 20 2009 - 13:47:22 CDT

Original text of this message