Re: Normalization, Natural Keys, Surrogate Keys

From: Rich Dillon <richdillon_at_mindspring.com>
Date: Thu, 16 May 2002 13:57:06 -0700
Message-ID: <eiWjfwR$BHA.2384_at_tkmsftngp02>


> creating a databse design which uses Natural Keys vs. Surrogate Keys
> because they "were necessary for the nomalization of the data model."
> The document went on to say, "natural keys are necessary to determine
> functional dependence and the efficient normalization of the the data
> model."
>
> Now, I may not be a database genius, but this doesn't sound right. It

As a simple example, let's say we record information about physicians in a medical practice on a lined piece of paper -- in this case, the equivalent of a table with no key declarations whatsoever.

fnm     lnm    gndr   upin
Jill     Smith    F     AB1234
Jill     Smith    F     AB1234
Jill     Smith    F     AB1234

The problems with this are obvious.

So what if we number the lines. Let's assume that this number represents a surrogate key which is generated by any means you like and is unknown to the outside world. The novice might be content with his new ability to identify each line uniquely. It should be clear though that this new scenario is little better than the old for the obvious reason that the # does not say anything about the physician -- it only describes the line on the paper.

#  fnm     lnm    gndr   upin
1  Jill     Smith    F     AB1234
2  Jill     Smith    F     AB1234
3  Jill     Smith    F     AB1234

And so it is with a database. We speak of rows in a table out of convenience. But rows aren't the things we model. Physicians (in this case) are. And identifying a row uniquely is of little value unless it is equivalent to identifying a physician uniquely -- an equivalence that can only be ensured if the natural key has been declared.

This last bit is also the answer to the question of functional dependency. The functional dependency above is *in reality* (upin -> fnm,lnm,gndr). That is, in the real world the value of upin implies the rest. The same can not be said about # unless we take it as a synonym for upin -- still requiring that the natural key is declared.

This doesn't mean that the lack of a natural key at the start has to be a show-stopper. We don't all have barcodes on our foreheads and as data architects this is a problem we have to keep in mind. In the end, though, something which is known in the world outside of the database needs to be declared as a key even if you've generated it yourself. Your bank gives you an account number which is printed on a card and on your statements and which you use to identify yourself to it. The utilities give you customer numbers which are printed on your bills. Your consultants are likely using the term "natural key" to include these -- well, I hope they are. :)

The surrogate key which is private to the database and declared alone (without any other keys) is the real problem because it has no ability to protect against duplicate entry and the consequences of failing to control redundancy can be serious.

All of the rules with which we define normalization in a data model deal with the prevention of redundancy and the update anomalies which result from redundancy. We can look at a schema in which only surrogate keys were declared and call it nth normal form while in reality the choice of keys has defeated the goal which that normal form is supposed to achieve. Neglecting to declare a natural key undermines your ability to protect the integrity of your data.

As an aside, I've not seen a clear guide for the use of the terms "surrogate key" and "natural key" to describe what are really four kinds of keys -- keys for internal use by the dbms, keys which are declared in the data model but not used by applications (other than for joining) and are unknown to the outside world, keys which are generated but externally used (account #s, invoice #s, etc), and keys which are natural attributes of the entity being modeled (from names to DNA). I've been using "surrogate" to represent the second of these and "natural" to cover the third and fourth. Received on Thu May 16 2002 - 22:57:06 CEST

Original text of this message