Re: Primary vs. Surrogate! What a nightmare debate.

From: Kostas <noemail_at_noemail.net>
Date: Sun, 17 Oct 2004 19:01:13 -0400
Message-ID: <10n5v3g8u3jvac9_at_corp.supernews.com>


Thanks for the lengthy and informative response. I have seen parts of this response actually either posted by you or quoted from others on different db forums and websites.

Cheers!
Konstantinos

"--CELKO--" <jcelko212_at_earthlink.net> wrote in message news:18c7b3c2.0410171238.3096021a_at_posting.google.com...
> We have a lot of problems with terminology on this one, so let me get
> that out of the way.
>
> There is no such thing as a "universal, one-size-fits-all" key. Just
> as no two sets of entities are the same, the attributes that make them
> unique have to be found in the reality of the data. Here is my
> classification of types of keys:
>
> natural artificial exposed surrogate
> ==================================================================
> Constructed from reality |
> of the data model | Y N N Y
> |
> verifiable in reality | Y N N N
> |
> verifiable in itself | Y Y N N
> |
> visible to the user | Y Y Y N
>
> 1) A natural key is a subset of attributes which occur in a table and
> act as a unique identifier. They are seen by the user. You can go to
> the external reality and verify them. you would also like to have
> some validation rule. Example: UPC codes on consumer goods (read the
> package barcode) and validate them with a check digit or a
> manufacturer's website, geographical co-ordinates (get a GPS).
>
> 2) An artificial key is an extra attribute added to the table which is
> seen by the user. It does not exist in the external reality, but can
> be verified for syntax or check digits inside itself.
>
> Example: the open codes in the UPC scheme which a user can assign to
> his own stuff. The check digits still work, but you have to verify
> them inside your own enterprise.
>
> If you have to construct a key yourself, it takes time to deisgn them,
> to invetn a validation rule, etc.
>
> 3) An "exposed physical locator" is not based on attributes in the
> data model and is exposed to user. There is no way to predict it or
> verify it. The system obtains a value thru some physical process in
> the storage hardware totally unrelated to the logical data model.
> Example: IDENTITY columns, other proprietary, non-relaitonal
> auto-numbering devices.
>
> Technically, these are not really keys at all, sinc they are
> attributes of the PHYSICAL storage and are not even part of the
> LOGICAL data model. But they are handy for lazry, non-RDBMS
> programmers who don't want to research or think! This is the worst
> way to program in SQL.
>
> 4) A surrogate key is system generated to replace the actual key
> behind the covers where the user never sees it. It is based on
> attributes in the table. Example: Teradata hashing algorithms,
> pointer chains.
>
> The fact that you can never see it or use it for DELETE and UPDATE or
> create it for INSERT is vital. When users can get to them, they will
> screw up the data integrity by getting the real keys and these
> physical locators out of synch. The system must maintain them.
>
> ** Notice that people get "exposed physical locator" and surrogate
> mixed up; they are totally different concepts. **
>
> An appeal to authority, with a quote from Dr. Codd: "..Database users
> may cause the system to generate or delete a surrogate, but they have
> no control over its value, nor is its value ever displayed to them
> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning. ACM
> Transactions on Database Systems, 4(4). pp. 397-434.
>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means
> never used in queries, DRI or anything else that a user does.
>
> Codd also wrote the following:
>
> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.
>
> (1) The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result
> that some or all of the serial numbers might be changed.).
>
> (2) Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.
>
> (3) It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it
> has ceased to have one (e.g. and applicant for a job and a retiree).
>
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities. A solution - proposed in part [4] and more fully in [14] -
> is to introduce entity domains which contain system-assigned
> surrogates. Database users may cause the system to generate or delete
> a surrogate, but they have no control over its value, nor is its value
> ever displayed to them....." (Codd in ACM TODS, pp 409-410).
>
> References
>
> Codd, E. (1979), Extending the database relational model to capture
> more meaning. ACM Transactions on Database Systems, 4(4). pp.
> 397-434
>
>>> My conclusion is that for small applications try to go with the
> natural key
> when it is readily available, but for large-scale warehouse
> applications
> surrogates should be seriously considered. <<
>
> Natural keys are vital to large-scale warehouse applications because
> you have to be able to verify and validate that data. You have to be
> able to link it to other data. How do you do that if you have your
> own little universe without any way to match on industry standard
> codes? Example: Wal-Mart uses Census data to plan its logisitics (the
> thongs that they sell in the senior citizens communities are for the
> feet; the thongs that they sell in the 18-25 year old resort beach
> communities are not). Why would they ignore the Census data codes and
> invent their own?
>
> If you don't care about data integrity, why have a warehouse at all?
>
> The steps for finding a key are:
>
> 1) Look for an industry standard and the trusted source that maintains
> it.
>
> 2) Look for a natural key in the attributes. Example: (longitude,
> latitude) makes a good key for a geographical location.
>
> 3) If you must design a new identifier, plan it carefully --
> especially if people will see and use it. You have to be able to
> validate it in application programs and the schema, so you need a
> regular expression, other syntax rule and/or check digits. You have
> to be able to be verify in the reality of the model or with a trusted
> source.
>
> Example: a bank account number carry a lot of information about which
> bank and who the account belongs to in them. Look up the IBAN
> (International Bank Account Number) Standard.
Received on Mon Oct 18 2004 - 01:01:13 CEST

Original text of this message