Re: Surrogate Key vs Production Key

From: --CELKO-- <jcelko212_at_earthlink.net>
Date: 9 Oct 2004 11:28:49 -0700
Message-ID: <18c7b3c2.0410091028.2a9a36c0@posting.google.com>

>> Oh boy, I wish you were not so verbose and roundabout, Joe. Since
your days at DBMS magazine nearly 20 years ago, it has been a trait of yours. <<

I get paid by the word in magazines :)
I am positively concise compared to most writers

>> You should add:

3) Can I ensure that a user entering the key will not mistype it? <<

That is part of validation -- the check digits are an old favorite of mine that new programmers simply do not know about. What are they teaching in schools? I understand not knowing the fancy ones like Diherdal Five, but Mod 11 is easy. The one thing that ought to have a CHECK() constraint on it is the keys.

>> That is by far the biggest source of errors. <<

Done a data warehouse yet? My favorite was the part number "I Hate my Job" which appeared thousands of times in legacy data and nobody had spotted it until the warehouse project.

>> Your tests are arbitrary and not specified anywhere as a mandatory
requirement. <<

I would not call validation and verification arbitrary ...

>> I've got a bit of news for you: it IS faster and easier to program.
And that means cost savings. <<

And what is the cost of the bad data you created?

>> And that is exactly what you get six relationships into the design,
with natural keys: a compound key that is so large it will be a total disaster to index and store... <<

This is the year 2004; storage is insanely cheap, why we make it from sand! And if there is a huge compound key, then you must make it unique to have a valid model -- unless doing it fast was more important than doing it right. What I find in reality with larger keys is that they are very natural to the users -- best examples: physical locations like (longitiude, latitude) or bin address systems (area, aisle, stack, bin).

You also forget that indexes are only one kind of access method. For compound keys and VLDB problems, I'd use a hash or compressed bit vector. Teradata for example takes an absolute max of two probes to locate any row regardless of the key; 90%+ are done in one probe.

>> and you believe all that is easier to program and faster than a
surrogate key based on a sequence, because?.... <<

Because I already have the columns that make up the key and I already have made them unique. In theory, if I were to type in a VIN in this posting, you could go to the Internet, and find out all about that car. If I were to type the exposed physical locator used in a particlar release of a particular product on one particular machine, what do you do with it?

Get on the internet and tell me what automobile is referenced by row_id (IDENTITY or whatever) 42 on my machine. Cannot do it. You have a file system with a physical index that is tied to the hardware, not a data model. if the machine wants to hide this "pointer chain", that's fine. But I never write code with it.

>> You missed all the discussions here about how unimportant database
portability really is, didn't you? :) <<

You haven't seen what I charge for that "unimportant" service, have you :)?

>> Is that why they use "pattern" programming nowadays? Which BTW
uses OID all over... <<

Because we have is bunch of certificate holders without any problem solving ability or education. You have to give them "paint by number" programming. Why don't newbie know about coupling, cohesion and basic software engineering nowdays? Rant, rant, rant.

>> No,you see: you cannot come up with an idiotic example and make it
stick as a valid interpretation. NO ONE advocates creating a table with two columns of type IDENTITY. <<

Look at the NUMBER(*) function in Sybase SQL Anywhere (nee WATCOM SQL); it could be used like any other function and you could have multiple calls. IDENTITY is a physical property and by its nature cannot be a proper datatype inside a table. The WATCOM guys took a different approach to auto-numbering things and they can do two or more columns.

I have worked with or advised 17 diffrent SQL products over the years --I've seen it all :)

>> You notion that a surrogate key needs to be compacted is straight
out of the 50s, quite frankly... <<

Yeah, so why do the SQL Server newsgroups have that question asked about every 3-4 months? becauyse the mental model is a 1950's file system!

>> No. The answer is to ignore ANY "meaning" of the surrogate or
IDENTITY key. It is YOU that is attributing a meaning to it. Uncalled for and totally incorrect and misleading. <<

Then you must never see it or code with it, as per Dr. Codd. My gripe is that newbies do code with it and expect it to maintain uniqueness for them.

>> Oh boy! So now all of a sudden, the order of storage of rows in a
relational database IS important? Wat was that bit about it not being ANY concern of the user? <<

If you use it as a key, then you should be able to verify it agains the reality of the model. Would you use a VIN number for your car title if it changed everytime you went to the DMV?

>>

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn), vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

... That would indeed be a newbie! <<

Yep! and that design earns me a good living doing repair work :) But let's extend it a bit and add a natural key.

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL UNIQUE -- one driver per vehicle

REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

If you update via ssn and I update via driver_id, what happens when they get out of synch?

UPDATE Drivers

SET ssn = '666-66-6666'
WHERE ssn = '777-77-7777';

Sure hope I got the right IDENTITY value when I say, Remember how worried you were about typos? Ever try to type a dozen GUIDs?

UPDATE Drivers

SET ssn = '666-66-6666'
WHERE driver_id = 42;

But worse:

BEGIN ATOMIC
DELETE FROM Drivers
WHERE ssn = ''777-77-7777';
-- opps! I did not mean to do that, so i need to re-build him

INSERT INTO Drivers (ssn, vin)
VALUES ('666-66-6666', 'KKNJH-JVH55-47984'); END; There are some other ways to screw up things -- rollbacks on the PK side after the FK side values are set, etc. IDENTITY was not meant to be used as a real key, since it lives only at the table level and not at the schema level. My point: programmer do not maintain fake pointer chains and when they try they are human will screw up.

When we had IDMS and other network databases, the system prevented most of the screw ups and utility programs to rebuild pointer chains. IDENTITY has nothing to do that. If I make a bulk copy error, the IDENTITY values are re-numbered and my PK-FK fake pointer chains are garbage.

>> And yet, since entity domains are not implemented or supported by
ANY database <<

Rdb, Mimer, Solid (I think)

>> ... (and few support cascaded updates), and we all have to live and
design and program systems in between, <<

DB2, Mimer, Ocelot, SQL Server, Sybase, etc.

You can use -- ugh!-- triggers until the vendors catch up to SQL-92. This avoids the need for exposed physical locators and lets you fake DRI cascades. I hate it because it is procedural code, but that's what we have. It is far safer than exposed physical locators.

>> I'd LOVE to use a pure relational database. One that fully supports
domains, for example. But the reality is: I'm stuck with products like Oracle. <<

Oracle is total piece of crap and I really feel sorry for the poor bastards that get stuck with it. You look at most production Oracle code and it is COBOL with sequential algorithms in a thing disguise. You just know that you are going to see cursors when you do the audit. Rant, rant, rant.

>> That is a VERY good point. About time Oracle fixed the RI
implementation in their engine. <<

How about the datatypes?

>> About time too that I could create a FK referencing a view column,
come to think of it. It is accepted by the syntax since V8, but inactive in terms of RI. Some rumours that it made it to 10g, but needs to be confirmed. <<

I don't understand that one. A VIEW does not exists until it is invoked then it disappears at the end of the query or statement. Any RI rule would have to be enforced in the defining SELECT in the VIEW definition. Is this "Oracle speak"?

Now that would make sense in the ANSI temp table model. The [LOCAL|GLOBAL] TEMPORARY TABLE cleans itself out either at the end of session or on COMMIT. But the structure is persistent and part of the schema, like any other table, so it can have any table constraint.

>> Nice to hear from you, BTW. <<

Likewise! Most of the kids these days on newsgroup are whining over-sensitive little wimps and you cannot get any good flow with them. Received on Sat Oct 09 2004 - 13:28:49 CDT