Re: dbdebunk 'Quote of Week' comment

From: -CELKO- <jcelko212_at_earthlink.net>
Date: 18 Aug 2005 15:06:06 -0700
Message-ID: <1124402766.125463.57460_at_f14g2000cwb.googlegroups.com>


The fact that this question is being considered is a sign that we have a lot of really bad SQL programmers. By definition a relational key is a subset of attributes of an entity or relationship being modeled in the schema. This is absolute foundations of the Relational Model kind of stuff.

IDENTITY is an exposed PHYSICAL locator. What does that phrase mean? The value is created by looking at the internal state of the hardware at the time a PHYSICAL record containing a row is inserted into storage. Its purpose is to locate the row without any regard to what the data means.

Think about using a pointer or a track/sector number; same thing but different mechanism. But SQL does not have a pointer data type or the mechanisms to handle pointer operators, garbage collection and housekeeping, so 20+ years ago the original Sybase SQL Server exposed an integer that can map back to the contiguous storage model used under the covers.

IDENTITY is not an attribute in the data model and can never be an attribute in the data model because it does not exist in the reality from which you derive your data model. The purpose is to fake a sequential file's positional record number, so I can reference the PHYSICAL storage location. Sure, I lose all the advantages of an abstract data model, SQL set oriented programming, carry extra data, destroy the portability of code and have no data integrity. Who cares? It is easier to write than a real normalized RDBMS schema.

More and more programmers who have absolutely no database training are being told to design a database. It takes six years to become a Journeyman Union Carpenter in New York State. But a programmer with only a few years in a procedural or OO language is expected to produce a correct and usable database.

What is he going to do? The smart ones will get some help and beg for training, knowing they are over their heads and can ruin the company. Most of them simply start programming SQL as if it were their native programming language. They grab at GUIDs, IDENTITY, ROWID and other proprietary auto-numbering "features" in SQL products to imitate either a record number (sequential file system mindset) or OID (OO mindset) since they don't know anything else.

They write code with cursors to mimic record-at-a-time file handling -- it is easy do a global replace of READ() with FETCH. They write to temp tables to mimic scratch files in a series of procedural steps - it is hard to think of a single declarative statement to get the answer. They use dynamic SQL and let the user figure out how the system should work on the fly - they never had a software engineering course and don't know what coupling and cohesion are.

Experienced database designers look for industry standard codes for their keys first. Try to tell the IRS you don't have anybody's SSN at tax time but Cindy Lou Who was employee 42 when you put her into the Personnel table and you have the IDENTITY value to prove it. Try to sell a car without a VIN - using IDENTITY for this would be like identifying a vehicle by the current local parking space number (think about it - that is exactly what an IDENTITY value is -- a parking space in one physical location).

We know that we have to use industry standard codes, either de facto or de jure. We also appreciate the fact that we can exchange data with the rest of the world. We appreciate the fact that someone else will maintain and define these codes. A trusted external source is a good thing to have.

If you don't have an industry standard or natural key and have to design your own codes, it is hard work to do it right. I know that newbie programmers want to start coding first and thinking later. It does not work that way and all the "wish magic" in the world will not change that fact. I have several chapters in my books on how to design encoding schemes, but that is another topic.

The first practical consideration is that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other SQL products. Newbies actually think they will never port code! Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such crap nobody else want their application. Otherwise, you will port code; you will share data with some other database; data does not exist in isolation.

But let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a property which belongs to the PHYSICAL table, not the LOGICAL data in the table.

Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.

Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the PHYSICAL order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in pre-allocated sequential files in the 1950's, by the way. A utility program would then "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the PHYSICAL end of the PHYSICAL file. IDENTITY leaves the gaps unless you write your own routine to do the compression.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one did you pick? Why? The answer has been to use whatever the *PHYSICAL* order of the result set happened to be. That non-relational phrase "PHYSICAL order" again!

But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different PHYSICAL order. Indexes and statistics are not part of the logical model.

Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.

How do you verify that an entity has the right key when you use an IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and read it off the dashboard. If I use a UPC for a candy bar, I can read the bar code on the wrapper. But why is little Cindy Lou Who employee 42? If I turn her upside, will I find that she has 42 tattooed somewhere? If I call an external trusted source, will they know that she is employee 42?

In the Relational Model, you do not invent a key in the storage. You discover a key (and the other attributes) in the real world and model it. If you create your own encoding for a key, then you have to maintain it, provide audit trails and do all the work that an industry standard organization would do for you.

If I lean on a mouse button, I can insert the same data with a new IDENTITY over and over. Cindy Lou Who is now employed two times and none of my reports are right! Now I have to write some procedural code like a trigger or a UNIQUE constraint on her SSN to prevent this, thus making the IDENTITY redundant. But we were assuming that we use only IDENTITY as a key, so we are screwed.

Newbies often design tables without bothering to look for a relational key, so they are so surprised when they do a data warehouse and nobody else has any idea what they are doing in their subsystem.

Cindy Lou Who now has two rows in Personnel. When we sign her up for the Dental Plan, we get the row with 42. When we sign her up for the Bowling Team, we get the row with 43. We find our error, and delete the row with 42 because we have a row that was created later and we assume it is more current.

Another common way to get this is to have two procedures, one for inserting a new employee to the Dental Plan and one for inserting a new employee to the Bowling Team. Both procedures create a row in Personnel since they use only IDENTITY as a key.

Well, now we have an orphan row in Personnel. In fact, in such systems, you will find a lot of orphans. I worked for a company that used GUIDs for OIDs substitutes and our software fell apart in about a year of actual use by a client.

Another cute way to destroy data integrity:

BEGIN ATOMIC
DELETE FROM Foobar
WHERE id = <<some identity value>>;
INSERT INTO Foobar
VALUES (<<recreate deleted row>>)
END; Logically this should do nothing, but since IDENTITY has gaps, it trashes the data. When one query uses the IDENTITY and another uses the real key, you are like a man with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

See this article for problems in replication with IDENTITY columns. http://www.simple-talk.com/2005/07/05/replication/  

--CELKO--
jcelko212_at_earthlink.net Received on Fri Aug 19 2005 - 00:06:06 CEST

Original text of this message