Re: Primary vs. Surrogate! What a nightmare debate.

From: --CELKO-- <jcelko212_at_earthlink.net>
Date: 17 Oct 2004 13:38:47 -0700
Message-ID: <18c7b3c2.0410171238.3096021a_at_posting.google.com>


We have a lot of problems with terminology on this one, so let me get that out of the way.

There is no such thing as a "universal, one-size-fits-all" key. Just as no two sets of entities are the same, the attributes that make them unique have to be found in the reality of the data. Here is my classification of types of keys:

                             natural artificial exposed surrogate 
==================================================================
Constructed from reality    |   
of the data model           |  Y         N         N         Y
                            |
verifiable in reality       |  Y         N         N         N
                            |
verifiable in itself        |  Y         Y         N         N
                            |
visible to the user         |  Y         Y         Y         N

  1. A natural key is a subset of attributes which occur in a table and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. you would also like to have some validation rule. Example: UPC codes on consumer goods (read the package barcode) and validate them with a check digit or a manufacturer's website, geographical co-ordinates (get a GPS).
  2. An artificial key is an extra attribute added to the table which is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself.

Example: the open codes in the UPC scheme which a user can assign to his own stuff. The check digits still work, but you have to verify them inside your own enterprise.

If you have to construct a key yourself, it takes time to deisgn them, to invetn a validation rule, etc.

3) An "exposed physical locator" is not based on attributes in the data model and is exposed to user. There is no way to predict it or verify it. The system obtains a value thru some physical process in the storage hardware totally unrelated to the logical data model. Example: IDENTITY columns, other proprietary, non-relaitonal auto-numbering devices.

Technically, these are not really keys at all, sinc they are attributes of the PHYSICAL storage and are not even part of the LOGICAL data model. But they are handy for lazry, non-RDBMS programmers who don't want to research or think! This is the worst way to program in SQL.

4) A surrogate key is system generated to replace the actual key behind the covers where the user never sees it. It is based on attributes in the table. Example: Teradata hashing algorithms, pointer chains.

The fact that you can never see it or use it for DELETE and UPDATE or create it for INSERT is vital. When users can get to them, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them.

  • Notice that people get "exposed physical locator" and surrogate mixed up; they are totally different concepts. **

An appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

>> My conclusion is that for small applications try to go with the
natural key
when it is readily available, but for large-scale warehouse applications
surrogates should be seriously considered. <<

Natural keys are vital to large-scale warehouse applications because you have to be able to verify and validate that data. You have to be able to link it to other data. How do you do that if you have your own little universe without any way to match on industry standard codes? Example: Wal-Mart uses Census data to plan its logisitics (the thongs that they sell in the senior citizens communities are for the feet; the thongs that they sell in the 18-25 year old resort beach communities are not). Why would they ignore the Census data codes and invent their own?

If you don't care about data integrity, why have a warehouse at all?

The steps for finding a key are:

  1. Look for an industry standard and the trusted source that maintains it.
  2. Look for a natural key in the attributes. Example: (longitude, latitude) makes a good key for a geographical location.
  3. If you must design a new identifier, plan it carefully -- especially if people will see and use it. You have to be able to validate it in application programs and the schema, so you need a regular expression, other syntax rule and/or check digits. You have to be able to be verify in the reality of the model or with a trusted source.

Example: a bank account number carry a lot of information about which bank and who the account belongs to in them. Look up the IBAN (International Bank Account Number) Standard. Received on Sun Oct 17 2004 - 22:38:47 CEST

Original text of this message