Re: Surrogate Key vs Production Key

From: Mark C. Stock <mcstockX_at_Xenquery>
Date: Wed, 13 Oct 2004 09:22:49 -0400
Message-ID: <RYGdnZkHtp-rsfDcRVn-og@comcast.com>

"Noons" <wizofoz2k_at_yahoo.com.au> wrote in message news:73e20c6c.0410130324.15cb8460_at_posting.google.com...
| "Mark C. Stock" <mcstockX_at_Xenquery .com> wrote in message
news:<WKGdnU6c4swsT_bcRVn-qQ_at_comcast.com>...
| >
| > i'm jumping into this one a little late, so maybe i've missed a few
| > comments...
|
| Yeah. One very important one: why is it that no significant
| commercial app uses natural keys? Not one!
| Another even more significant one: why is it that Oracle's
| OWN apps do not use natural keys?
|

you're talking about implementation decisions -- and in a commercial app there will be a much higher level of compromise, either for generic functionality or because of marketing pressure. and generally, generalization are never true ;-)

big difference -- perhaps the entire reason for apparent disagreement

you appear to sometimes use 'natural key' to refer to an 'intelligent key', which is quite different from a natural key structure, that is, a composite (multi-column) primary key or foreign key, and sometimes use it to refer to a multi-column composite key

you also appear to use the term 'surrogate key' as a synomym for 'system-assigned', which it is not. surrogate is a replacement, which could just as easily be user-assigned, but typically is system assigned. a system-assigned key may very well be a component of a natural key structure

VIN was mentioned as a natural key -- but it actually is an intelligent key which has not been decomposed. If it is decomposed into its component columns, then it becomes a natural key structure and significant information about what is and is not in the scope of the data model becomes apparent, as each component and the relationship that it represents is exposed, and hopefully properly analyzed. That analysis will not happen if the data modeler simply grabs a data values that is currently used in the business as a unique identifier without determining how that value is established and if it needs to be decomposed, or if that value is simply used as an attribute and a system-assigned key is used as a surrogate key (which may not be a bad thing, the point is, the analysis of the business's current practices must be peformed)

Data Modeling 101: the conceptual model precedes the implementation (referred to sometimes as the logical model or the physical model), and implementation details are compromises made to apply the conceptual model to a particular technology or environment. The relationship between the two models is essential to good design.

| > but when establishing the data model, introducing a surrogate key column
to
| > replace a natural key column structure seriously weakens the data model
and
| > often hides (or impedes discovery of) very important facts about the
| > business
|
| Discovery of what? If you haven't found it at conceptual model
| time already, anything you find now probably is incorrect, a
| misconception or an arbitrary decision.
|
|

My comments were specific to the data modeling phase (conceptual model) but discovery does continue during initial implementation as well as ongoing use of the data. No analyst, and no data model, perfectly captures every nuance of a business at the first take -- new discoveries typically clarify earlier incorrect, arbitrary or misconceived ideas. I expect my knowledge (and the users' knowledge) of a business process to grow during the life of an application, not peak during the initial analysis. (I know that last comment leaves me open to criticism of doing insufficient analysis, but the careful reader will see that my point is that more will always be learned as the application is used and evolves, new observations and discoveries are not automatically incorrect).

|
| >
| > by modeling this without a surrogate key, the business rule 'a person
can
| > only be assigned to use a vehicle that is assigned to her department' is
| > easily enforced via the relationship 'each person may currently be
assigned
| > one and only one department vehicle'
|
| "easily enforced". Above, it was "a serious fact about the
| business" that would be "hidden" by a surrogate key.
| You see, you can't just change the facts to match the example.
| Either you provide a concrete example of that "weakness", or
| you provide an example of a convenient way of specifying
| an arbitrary relationship. The two are not the same.
|

since i presented a hypothetical case, i covered two possible ways the business rule could be viewed by the sponsoring organization. in real life, the facts would have to be established and then used to decide on the appropriate model -- no changing of facts here, just presenting variations of a scenario and how they can be handled using the natural key structure

i think it's quite concrete that using a column as an FK to a parent table and part of a multi-part FK to a sibling table maintains the sibling part of the relationship quite efficiently -- using surrogate keys would require additional code

the second example of not overlapping the FKs still has implementation benefits over the surrogate key approach, while providing more flexibility for the business rule to change (note that in real life facts do change, and the analyst must assess the probability of change when making design and implementation decisions)

and, more important, preserving this key structure in the conceptual model supports more thorough analysis of the business problem

since you don't like natural keys, i don't think i could provide any additional examples that you'd be any happier with ;-)

well, i think an FK or CHECK violation is quite automatic

if the early model is 'test-driven', it becomes very apparent that if an employee has been assigned to department 10, and a vehicle 37 is assigned to department 14, the employee cannot ever be assigned to vehicle 37, since the assigned department does not match the vehicle department -- and that just requires FK constraints (conversely, if i try to reassign the employee to another department, i am prevented until the current vehicle assignment is removed)

if a surrogate key is used, the vehicle 37 assigned to department 10 would be accessed via vehicle assignment #3213 -- there is nothing in the model that prevents a department 10 employee from being assigned that vehicle... additionaly code would need to be required to lookup the assigned vehicle's department and compare it with the employee's assigned department. no code versus some code seems easier to me (plus additional code to handle checking of the vehicle assignement before reassigning the employee to another department)

perhaps you'd care to provide an example of implementing this scenario with surrogate keys?

|
|
| > time and time again i have seen very poor data models developed because
an
| > organization standardized on surrogate keys and replaced every
multi-column
| > primary key with a single-column surrogate key as soon as the entity was
| > discovered
|
| time and again I've seen very poor data models developed because
| an organization insists on using "natural keys"
| were there ain't any. And time and again I've seen every commercial
| system out there never using natural keys. I wonder why?
| Ah yes: they are ALL "very poorly" designed?
|

you are confusing a number of issues.

i made clear that by 'natural key' i am refering to the structure. the key in the primary entity will likely be system-assigned.

your comment seems to focus on user-assigned values that are not properly decomposed. and, yes, they do cause problems (see my comments on VIN as a key column vs VIN as a decomposed multi-column key)

and yes, many commercial systems are very poorly designed, but not all. and many do use natural key structures (ie composite PK and FK structures), typically referencing system-assigned PK values from the primary entities.

running the following query against a schema in a commercial application will prove quite instructive:

SELECT   c.table_name
        ,c.constraint_type
        ,c.constraint_name
        ,COUNT(*)
    FROM dba_cons_columns cc
        ,dba_constraints c
   WHERE c.owner = :owner
     AND constraint_type IN ('P', 'R')
     AND c.owner = cc.owner
     AND c.constraint_name = cc.constraint_name
GROUP BY c.table_name
        ,c.constraint_type
        ,c.constraint_name

HAVING COUNT(*) > 1
ORDER BY c.table_name

        ,constraint_type
        ,constraint_name

if any rows show up, there very likely is some use of natural key structures (composite keys)

using Oracle Portal as an example:

the table WWA_FORM$ has a 3-column PK, which consists of SUBSCRIBER_ID, ID, and VERSION_ID). it is referenced by WWA_FORM_ITEM$, which as a corresponding 3-column FK, and a 4-column PK.

this correctly models a dependent relationship with a couple exceptions -- the WWA_FORM$ table is actually a Form Version table, the designer left out the actual Form table in favor of storing data denormalize in the WWA_FORM$ table (or perhaps after determining that the true Form table would have no attributes other than the key). there also is no FK to a Subcriber table -- again, the designer likely determined that that table was out of scope.

so, looks like there is at least one commercial app that used natural key structures (composite keys) rather than surrogate keys

maybe we agree after all, to a point. but if you advocate always using surrogate keys for all tables, then you end up with a very broad, and arbitrary, divergence between the conceptual model and the implementation -- and restrict yourself from employing some very useful data modeling techniques

that's the point -- but the reasons for the compromises should also be documented, and there should by no IFs about having a conceptual data model

thanks for the post -- i was having trouble waking up this morning and this got my mind moving.

++ mcs Received on Wed Oct 13 2004 - 08:22:49 CDT