Re: candidate keys in abstract parent relations

From: Forrest L Norvell <spankysyourpal_at_gmail.com>
Date: 22 Jan 2006 15:20:08 -0800
Message-ID: <1137972007.999349.245450_at_g43g2000cwa.googlegroups.com>


vldm10 wrote:
> David Portas wrote:
>
> > Yes it's a bad idea. Here's a generic example. Consider two tuples:
> >
> > a tuple = {'A','B','C'}
> > a tuple = {'A','B','C'}
> >
> > They are the same. It serves no purpose to represent the same
> > information twice, which is why we can safely say that a relvar must
> > have at least one candidate key - the generality of the model isn't
>
> Here we have the following problem: Is it always possible to represent
> the entity's ( or relationship's) instance in a RM? If you know
> for some theory which support these ideas (even if it is a
> philosophical theory) let us know.
> I don't think that regarding the real world situation "we can
> safely say that relvar must have at least one candidate key" as you
> stated. Regarding some abstract mathematical relation, which doesn't
> represent reality, your sentence is true.

It's a mistake to believe that the "real world" and the world of "pure theory" are somehow incompatible; a theory that provides no help when confronting practical problems isn't a very useful theory. The relational model has no problem a priori with surrogate keys, it only states that a relation cannot contain duplicate values. I may be mistaken, but the phrase "surrogate key" is an ex post facto addition to relational database theory and isn't part of the underlying relational model at all.

A common rule within the database design community is to abjure surrogate keys, although a lot of totally handwavy rationales are provided for discouraging them (Celko's explanation that they "expose the physical layout" of the table is egregious in this regard: any table with auto-generated sequence identifiers and the ability to delete rows is going to develop holes in the sequence over time which aren't reflected in the physical representation of the database, and I'd be willing to bet at least a few DBMS implementations allow their instances to move rows around in the physical files with little respect for the order in which they were initially added). The real problem with surrogate keys is how frequently they're used as a remedy for a poorly-considered schema design, due to their automatic and innate guarantee of row uniqueness (presuming the generator functions aren't totally broken, which isn't always a safe assumption).

So the "real world" isn't really the problem here: the real-world solution to the problem of ensuring uniqueness is to use a surrogate key. But by the same token, the word "surrogate" is important. It can have (at least) two meanings in this case: a surrogate key can act as a proxy for another, natural candidate key on the relation, for whatever reason (surrogate keys make foreign key relationships very straightforward to express, which is an important consideration in managing the complexity of highly normalized schemas), and a surrogate key can also act as a stand-in for an ephemeral or hard-to-capture notion of uniqueness in the world of this particular database's design.

In both cases (especially the latter), the overall integrity of the database is incumbent more on the designer's vision and design skills than depending upon formal methods to rescue designers from their own bad decisions; this is one of the risks of allowing the use of surrogate keys in designs. But there are lots of ways to screw up relational designs, and surrogate keys suffer more than their fair share of abuse and mystification precisely because of their popularity and the lazy thinking their use can betray.

After thinking this all through, the methodology I've derived is pretty simple: I'm going to continue associating surrogate keys with the tables that aren't defined purely in terms of foreign keys (such as the tables used in many-to-many associations), but in each case I'm going to ensure that I've identified (and created a corresponding uniqueness constraint to enforce) at least one natural key for each table. This is a practical accomodation of a theoretical constraint: being obligated to identify a candidate key that emerges from attributes associated with my entities keeps my tables and design relational, and has the side benefit of making me confront what defines identity for my entities, which can result in clearer and more logical designs. At the same time, my DBMS of choice penalizes me for using non-integer indexes as primary keys, and I can't imagine the situation improves when considering the use of primary keys defined over multiple attributes.

> Example:
> In the case that the entity's (or relationship's) instance repeats
> the same values or in the case when we have the two entity's
> (relationship's) instances with the equal values and the relation
> which has "natural key", then it is not possible to represent these
> entities' (relationships') instances in the relation. But, if the
> key is "surrogate key", then we can represent mentioned instances
> in the relation of the RM.
>
> In the case that your database is in a production and that you have
> "natural key" which can't solve above mantioned problems, you can
> consider to redesign your application (this can be very complex and
> with millions of data). This redesign can have the limitations what is
> another long story.

As long as the users of the database can be confident that integrity is being maintained, it doesn't make sense to go to huge efforts to remove surrogate keys in the name of design purity. However, if poorly-chosen surrogate keys / badly-designed tables allow applications to erode relational integrity, you gotta bite the bullet and fix the problem, one way or another.

> Let me give you one example from the business environment but from a
> different point of view.
> A Honda dealer has received 100 new 2006 Honda civic cars and these
> 100 cars have the list with the 50 same attributes. You should design
> the Database for these entities using "natural key" and without
> using Vehicle Identification Number (VIN) or any other "surrogate
> key" or "artificial key".

This betrays a misunderstanding (perhaps not yours, I need to add) of what Codd (and Date, and others) mean by a "surrogate key" -- while a VIN is clearly a generated, synthetic key, it's not a surrogate, because it's handed to you by a source external to your database's universe; surrogates are created and managed internal to the DBMS. The reason I make this distinction is because I think VINs make an excellent candidate key for cars: they're guaranteed to be present, they're guaranteed to be unique, and they're guaranteed to last as long as the the relevant entity is in existence (a car with its VIN filed off is very unlikely to remain anything recognizable as the same car, as anyone who's had their car stolen and chopped learns). Just because they're incomprehensible gibberish handed to you by an external agency doesn't make them surrogate keys. This is why Codd's stricture that surrogate keys not be exposed outside the database is important: a surrogate key handed out to the world at large ceases to be a surrogate key, and becomes an artificial identifier that's fair game for use by parties over which the originating database has little influence or control. If it meets a set of criteria (some of which I mentioned above), it's even a *good* choice for a unique identifier, and therefore a key.

> And you will see that this is
> impossible, because all "fields" have the same corresponding
> values. The point is not that you can't work here with "natural
> key", the point is that you can't even represent these entities in
> the RM if you are using "natural key". Same thing if you are
> selling the Microsoft Windows on CD or the desktop computers...

I think this is really a consequence of having an overly restrictive definition of what can be included in a natural key. However, one of the consequences of using generated identifiers is that they end up being the sole component of the key, due to the principle of necessary irreducibility. Which is why there's typically something weird / bad going on if a surrogate identifier is used as an element in a multi-attribute candidate key.

Forrest L Norvell Received on Mon Jan 23 2006 - 00:20:08 CET

Original text of this message