Little design mistakes that can be easily avoided (2): Listenning to CELKO (and CELKO alikes)

From: Cimode <cimode_at_hotmail.com>
Date: 26 May 2007 08:26:28 -0700
Message-ID: <1180193188.041346.71010_at_q66g2000hsg.googlegroups.com>



Hi,

Chapter 2 of the little things that can be done at little cost to avoid misconceptions about RM design.

Lately, I opened a thread that gave opportunity for CELKO to post the exact essence of what should *not* be done in design: let subjectiveness regulate design instead of logic. I take the opportunity of his thread to point out how and why it is wiser to ignore such ideas to build better databases. For that I will take all the posted comments and try to make sense out of them.

Regard...

<<I have a taxonomy of key types. We have a lot of problems with terminology on this one, so let me get that out of the way.>> The solution starts with saying we have a problem. Indeed, that's promising for the rest of the post.

<<There is no such thing as a "universal, one-size-fits-all" key. Just
as no two sets of entities are the same>> or how to use a pointless analogy to prove a arbitrary meaningless idea.

<<, the attributes that make them unique have to be found in the reality of the data. >>
What is reality of data ? Does it mean that some data is real and some
is not?

<<Here is my classification of types of keys (needs a monofont):>> Given what I read until know, I am already brassing for impact.

                             natural artificial exposed surrogate
==================================================================
Constructed from reality    |
of the data model           |  Y         N         N         Y
                            |
verifiable in reality       |  Y         N         N         N
                            |
verifiable in itself        |  Y         Y         N         N
                            |
visible to the user         |  Y         Y         Y         N





<<1) A natural key is a subset of attributes which occur in a table and
act as a unique identifier. They are seen by the user.>>>> And that which subset of attribute that would be ?

Let me rephrase that : *because they would be seen by the user, a subset of attribute would be a natural key*. First example on how to introduce subjectivity: define concept of natural key according to human perception.

<<You can go to the external reality and verify them.>> Can somebody define what is *external reality*? Is there an *internal reality* ? I would have sweared there was only but one reality. Second example on how to introduce subjectivity: considering there are several realities based on perception.

<<you would also like to have some validation rule. Example: UPC codes on consumer goods (read the package barcode) and validate them with a check digit or a
manufacturer's website, geographical co-ordinates (get a GPS).>> So here is the final example using the *observe method*. While defining a natural key, it is funny to observe that we finally get to UPC code that may be itself a non natural key. What is natural key again ?

To summarize that first attempt by CELKO at defining a natural key:

--> A natural key is a subset of attribute. Question: Which one ?
--> That undefined subset of attributes is indeed a natural key as
soon as the user sees it.
--> There are several realities. In one of these realities called
*external reality*, the same user that saw the subset will be able to verify them.
--> That verification can for instance be done on a UPC code. A UPC

code is then a natural key (If I got this right).

I let the above definition to the evaluation of people. Now let's see the next element of the *taxonomy* exposed for classifying of primary keys.

<<2) An artificial key is an extra attribute added to the table which is
seen by the user. >>
So an artificial key is an extra attribute added to *some* table which *de facto* makes it an artificial key. Let's see the consequence of this intriguing statement: Take a table ADD an attribute and TA TA you got an *artificial key*. It reminds of the old magic tricks, you
know the one where a magician pulls a rabit from his hat. Third example of how to introduce subjectivity in key selection: believing in magic.

<It does not exist in the external reality, >> Again this *external reality* . Can somebody please explain what is *external reality* again (this is the second time I ask).

<<but can be verified for syntax>>
Ah, the verification process. That, my feeble mind begins to understand ... BUT WAIT...NO: it should be verified for syntax. I was positive I understood but then my mind does not grasp anymore: what is verifying for syntax? What syntax ? How do I verify syntax ?

<< or check digits inside itself.>>
I give up on understanding. Can somebody explain show me exactly how one *checks digits inside themselves*. This sentence is a masterpiece of things apparently only CELKO understands.

<<It is up to the DBA to maintain a trusted source for them.>> I thought we were defining an artificial key. Nw we are defining a DBA's role

<<Example: the open codes in the UPC scheme which a user can assign to his own stuff. The check
digits still work, but you have to verify them inside your own enterprise.>>
Again that UPC code. But I thought the UPC code was supposed to be a natural key from the previous definition of natural key.

<<If you have to construct a key yourself, it takes time to deisgn them,
to invent a validation rule, set up audit trails, etc.>> What key again? The artificial ? the natural ? THe one I can verify in *external* reality ? By syntax? by checking digits inside itself ? The one that DBA maintains ? Help I am lost!

To summarize that first attempt by CELKO at defining an artificial key (pay attention!):

--> An artificial key (not a natural key) is an extra attribute added
to *some* table
--> An artificial key does not exists in *external reality*.
--> An artificial key must be verified for *some* syntax ?

--> An artificial key should have it digits checked inside themselves
(I should have known!)
--> It's a DBA's role to maintain ALL the above source from a thrusted
source ? What source ?

Now that I am perfectly positive that I have no clue of what CELKO is refering to, I am going to try again (note that I don't give up easily!). Let' see with the third element. It seems not to talk

<<3) An "exposed physical locator" is not based on attributes in the data model and is exposed to user. >>
So that's a new key right ? (Oh yes he said <<Here is my classification of types of keys (needs a monofont):>> previously) So an exposed physical locator is a key right ? Yes I begin to understand. WAIT NO : I must first understand what is an exposed physical locator. (I am finally beginning to have some hope on understanding some day what CELKO refers to).

<<There is no reasonable way to predict it or verify it,>> NOOOOOOO. When I finally had some hope, CELKO tells me that there is no way I could verify or predict what an *exposed physical locator* is. Can somebody explain how one can define something he neither can predict nor verify ? OH YEAH it's magic (DUH)

<< since it usually comes from the physical state of the hardware at the time of data insertion.>>
Oh I get it!!! The third type of key called (the one I can neither verify nor I can predict) depends on the state of hardware at time of insertion. How silly of me: I was really convinced a key was a logical key but I understand now (thanks to CELKO) that a key is indeed a physical concept. Fourth example of how to introduce subjectivity in key selection: believing that a key is determined at some point by the physical layer.

<<The system obtains a value thru some physical process in the storage hardware
totally unrelated to the logical data model. Example: IDENTITY columns, other proprietary, non-relaitonal auto-numbering devices.>> Can someone explain how an arbitrary counter is *de facto* a possible primary key.

<<Technically, these are not really keys at all, since they are attributes of the PHYSICAL storage and are not even part of the LOGICAL data model. But they are handy for lazy, non-RDBMS programmers who don't want to research or think! This is the worst way to program in SQL.>>
BUT WAIT!!! Is CELKO saying that he created an entire taxonomy for keys just to let me know that the third element of that taxonomy in fact is NOT a key and wwas simply made up for lazy asses ? Fifth example of how to introduce subjectivity in key selection: selecting keys in function of lazy people.

Let me summarize that third type of key know as *exposed physical locator*

--> An *exposed physical locator* is not based on attributes
--> An *exposed physical locator* is determined physically
--> An *exposed physical locator* can not be verified, predicted but
it can defined
--> An *exposed physical locator* is a key for lazy people. BUT it is
not a key.

I am beginning to lose hope in understanding some day CELKO's taxonomy and definition for keys. After all this is the third element I don't get. BUT WAIT I am a tenacious DBA who will get to the bottom of this. I will read the last element and I shall understand CELKO...

<<4) A surrogate key is system generated to replace the actual key behind the covers where the user never sees it.>> A suurogate key replaces the key ? Which one of the above does it replace ? The natural key (element 1) ? The artificial key (element 2)? The "exposed physical locator" (element 3). Let me guess: A surrogate key replaces the natural key (element 1) . Therefore, a surrogate key is a key that a user does not see which replaces the natural key (you know the one that is a subset of attributes which becomes a key as soon as the user sees it).

<<It is based on attributes in the table. Example: Teradata hashing algorithms,
indexes, pointer chains, ADABASE numbers, etc.>> What attributes again ? AH the attributes...

<<The fact that you can never see it or use it for DELETE and UPDATE or
create it for INSERT is vital. When users can get to them, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them. >>
That's one thing I understand. But I have the impression I heard that before from previous reading.

<<** Notice that people get "exposed physical locator" and surrogate mixed up; they are totally different concepts. **>> Thanks god I was cautious. Let me get this right: the *exposed physical locator* is not a *surrogate key*.

<<An appeal to authority, with a quote from Dr. Codd: "..Database users
may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.>> That I understand but I don't understand how does that explains all the things that I have not understood so far.

<<This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.>> Oh I get it a surrogate key is an index. An index is physical therefore a surrogate key is physical. But WAIT an exposed physical locator is physical too...(?) but they are totally different concepts. I am lost.

<<Codd also wrote the following:

There are three difficulties in employing user-controlled keys as permanent surrogates for entities.>>
Again that Codd ? Who's Codd anyway?

<<
(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410). >>

What again is the relationship between the above and the taxonomy of four elements established. Oh I understand at least what a surrogate key is: This Codd is really good. One out of 4 type I finally understand thanks to him. Not too bad. I was right not to give up.

Let me summarize that fourth of key know as *surrogate key*

--> Surrogate keys have nothing to do with *exposed physical locator*,
even though they are both physical
--> Surrogate keys are defined by Codd

[Snipped guidelines on getting a key according the above]

<<This is hard work. I have a few chapters in SQL PROGRAMMING STYLE on
scales, measurements and how to design encoding schemes.>> Hard work!!! CELKO is probably kidding. What *is* hard work is basically making sense of his *taxonomy* (Probably lost twelve pounds into reading the three first elements).
Sixth example on how to introduce subjectvity into key selection: making sense out of CELKO wrting and taxonomy



FINAL SUMMARY A summary of the little traps to avoid and examples exposed in this thread:

--> 1 Defining keys according to human perception
--> 2 Considering there are several realities (external/internal -
based on human perception)
--> 3 Believing in magic
--> 4 Believing that a key is physical concept
--> 5 Defining keys in function of lazy people.
--> 6 Making sense out of CELKO writing and *taxonomies*
Received on Sat May 26 2007 - 17:26:28 CEST

Original text of this message