Re: Surrogate Key vs Production Key

From: Noons <wizofoz2k_at_yahoo.com.au>
Date: 9 Oct 2004 00:14:50 -0700
Message-ID: <73e20c6c.0410082314.33358956@posting.google.com>

jcelko212_at_earthlink.net (--CELKO--) wrote in message news:<18c7b3c2.0410081648.6958dfc4_at_posting.google.com>...

Oh boy, I wish you were not so verbose and roundabout, Joe. Since your days at DBMS magazine nearly 20 years ago, it has been a trait of yours.

>
> 1) Can I validate the key in itself? For example, an ISBN has a check
> digit, I parse a VIN to see that it is in the format with a simple
> regular expression, etc.
>
> 2) Can I verify the in the reality I am modeling? For example, I can
> read the barcode on the back of a book for the ISBN, I can read the
> VIN off of the engine block of my car, etc.

You should add:
3) Can I ensure that a user entering the key will not mistype it?

That is by far the biggest source of errors. Hope I don't need to go into why changing a key after data entry is bad?
Codd in your quote below points out this problem very clearly. Why did you omit it?

> Exposed physical locators like IDENTITY in T-SQL or Sequence in Oracle
> fail both tests.

Your tests are arbitrary and not specified anywhere as a mandatory requirement. Besides like I said: you omitted the most important.

> natural two or three integer compound key with a huge GUID that no
> human being or other system can possibly understand because they think
> it will be faster and easy to program.

I've got a bit of news for you: it IS faster and easier to program. And that means cost savings.

> database. A tree index requires more and more probes as the size of
> the database increases.

And that is exactly what you get six relationships into the design, with natural keys: a compound key that is so large it will be a total disaster to index and store...

> A3: A long key is not always a bad thing for performance. For
> example, if I use (city, state) as my key, I get a free index on just
> (city). I can also add extra columns to the key to make it a
> super-key when such a super-key gives me a covering index (i.e. an
> index which contains all of the columns required for a query, so that
> the base table does not have to be accessed at all).

and you believe all that is easier to program and faster than a surrogate key based on a sequence, because?....

>
> A handle to the row? Oh, you mean faking a sequential file's
> positional record number, so I can reference the physical storage
> location?

No. They mean a handle to the row. Its sequence is immaterial, always was, always will be. Or else it is not a surrogate key, it is a natural key.

> Experienced database designers tend toward intelligent keys they find
> in industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They

Except when they are using datawarehouses or any system that deals with human data entry. Try a search in google and you'll see what I mean. Actually, search google with "referential integrity surrogate key vs natural key" and you'll see what I mean.

> But physically contiguous storage is only one way of building a
> relational database and it is not the best one. The basic idea of a
> relational database is that user is not supposed to know *how* or
> *where* things are stored at all, much less write code that depends on
> the particular physical representation in a particular release of a
> particular product on particular hardware at a particular time.
>

Actually, the basic idea of a relatioal database is that the user is not supposed to know how or where thigns are stored at all, PERIOD.

The rest of the stuff about coding is your own addition and is nowhere to be seen in the relational model: it deals with data management, not programming.

> The first practical consideration is that IDENTITY is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other products. Newbies

You missed all the discussions here about how unimportant database portability really is, didn't you? :)

> their code is such crap nobody else want their application. But
> professionals write code that has some ability to endure.

Is that why they use "pattern" programming nowadays? Which BTW uses OID all over...

> But let's look at the logical problems. First try to create a table
> with two columns and try to make them both IDENTITY. If you cannot
> declare more than one column to be of a certain data type, then that
> thing is not a datatype at all, by definition. It is a property which
> belongs to the PHYSICAL table, not the LOGICAL data in the table.

No,you see: you cannot come up with an idiotic example and make it stick as a valid interpretation. NO ONE advocates creating a table with two columns of type IDENTITY. It's not been done anywhere, it's not needed for surrogate keys. So, your example is totally misconstrued.

> that column in that particular table. This is how we did record
> numbers in pre-allocated sequential files in the 1950's, by the way. A
> utility program would then "pack" or "compress" the records that were
> flagged as deleted to move the empty space to the physical end of the
> physical file.

You notion that a surrogate key needs to be compacted is straight out of the 50s, quite frankly...

> Since a query result is a table, and a table is a set which has no
> ordering, what should the IDENTITY numbers be? The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to number (n) rows, so which one do you pick?
> The answer has been to use whatever the *physical* order of the result
> set happened to be.

No. The answer is to ignore ANY "meaning" of the surrogate or IDENTITY key. It is YOU that is attributing a meaning to it. Uncalled for and totally incorrect and misleading.

> But it is actually worse than that. If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different physical order. Non-deterministic behavior is not good for
> data quality.

Oh boy! So now all of a sudden, the order of storage of rows in a relational database IS important? Wat was that bit about it not being ANY concern of the user? What was the ORDER BY clause for? You missed it, didn't you? Why are you expecting two queries on same data WITHOUT an ORDER BY to return the same ordered set at two different times? Do you realise how many relational pragmas that breaks?

> Using IDENTITY as a primary key is a sign that there is no data model,
> only an imitation of a sequential file system. Since this "magic,

Using a surrogate key means absolutely NOTHING in terms of order of data. You are attributing a sequence to it that only exists in your imagination.

> database product, how do you verify that an entity has such a number
> in the reality you are modeling?

You don't "verify" anything against a surrogate key other than referential integrity, that is the whole point.

> You will see newbies who design tables like this:
>
> CREATE Drivers
> (driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
> ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
> vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));
>
> Now input data and submit the same row a thousand times, a million
> times. Your data integrity is trashed. The natural key was this:

That would indeed be a newbie!

> Finally, an appeal to authority, with a quote from Dr. Codd:
> "..Database users may cause the system to generate or delete a
> surrogate, but they have no control over its value, nor is its value
> ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and
> Codd, E. (1979), Extending the database relational model to capture
> more meaning. ACM Transactions on Database Systems, 4(4). pp.
> 397-434.

EXACTLY! Why are YOU ascribing a meaning and an implied sequence, to it?

>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means
> never used in queries, DRI or anything else that a user does.

Precisely. So, why do you insist and persist in doing exactly the opposite in your examples?

> Codd also wrote the following:
>
> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.
>
> (1) The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result
> that some or all of the serial numbers might be changed.).
>
> (2) Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.
>
> (3) It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it
> has ceased to have one (e.g. and applicant for a job and a retiree).
>
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities. A solution - proposed in part [4] and more fully in [14] -
> is to introduce entity domains which contain system-assigned
> surrogates. Database users may cause the system to generate or delete
> a surrogate, but they have no control over its value, nor is its value
> ever displayed to them....." (Codd in ACM TODS, pp 409-410).

And yet,since entity domains are not implemented or supported by ANY database (and few support cascaded updates), and we all have to live and design and program systems in between, can we close the case and just say:

surrogate keys are the way to go UNTIL such time as the alternative is fully supported and viable in actual real life commercial database systems?

I'd LOVE to use a pure relational database. One that fully supports domains, for example. But the reality is: I'm stuck with products like Oracle. And I've got to pump out systems, not theory. Therefore, surrogate keys stay. The real ones, not the abortion that you use in your examples.

> Working with Oracle and not having ON UPDATE CASCADE is a HUGE pain
> for you. You will have to make up for their failure to provide you
> with even a minimal SQL implementation in their sub-standard product.
> if you can switch to a real SQL, I would do so immediately.

That is a VERY good point. About time Oracle fixed the RI implementation in their engine. About time too that I could create a FK referencing a view column, come to think of it. It is accepted by the syntax since V8, but inactive in terms of RI. Some rumours that it made it to 10g, but needs to be confirmed.

Nice to hear from you, BTW. Received on Sat Oct 09 2004 - 02:14:50 CDT