Re: dbdebunk 'Quote of Week' comment

From: Dan <>
Date: Fri, 19 Aug 2005 10:44:58 -0500
Message-ID: <%7nNe.30$>

On 8/18/2005 5:06 PM, -CELKO- wrote:
> The fact that this question is being considered is a sign that we have
> a lot of really bad SQL programmers. By definition a relational key is
> a subset of attributes of an entity or relationship being modeled in
> the schema. This is absolute foundations of the Relational Model kind
> of stuff.
> IDENTITY is an exposed PHYSICAL locator. What does that phrase mean?
> The value is created by looking at the internal state of the hardware
> at the time a PHYSICAL record containing a row is inserted into
> storage. Its purpose is to locate the row without any regard to what
> the data means.
> Think about using a pointer or a track/sector number; same thing but
> different mechanism. But SQL does not have a pointer data type or the
> mechanisms to handle pointer operators, garbage collection and
> housekeeping, so 20+ years ago the original Sybase SQL Server exposed
> an integer that can map back to the contiguous storage model used under
> the covers.
> IDENTITY is not an attribute in the data model and can never be an
> attribute in the data model because it does not exist in the reality
> from which you derive your data model. The purpose is to fake a
> sequential file's positional record number, so I can reference the
> PHYSICAL storage location. Sure, I lose all the advantages of an
> abstract data model, SQL set oriented programming, carry extra data,
> destroy the portability of code and have no data integrity. Who cares?
> It is easier to write than a real normalized RDBMS schema.
> More and more programmers who have absolutely no database training are
> being told to design a database. It takes six years to become a
> Journeyman Union Carpenter in New York State. But a programmer with
> only a few years in a procedural or OO language is expected to produce
> a correct and usable database.
> What is he going to do? The smart ones will get some help and beg for
> training, knowing they are over their heads and can ruin the company.
> Most of them simply start programming SQL as if it were their native
> programming language. They grab at GUIDs, IDENTITY, ROWID and other
> proprietary auto-numbering "features" in SQL products to imitate either
> a record number (sequential file system mindset) or OID (OO mindset)
> since they don't know anything else.
> They write code with cursors to mimic record-at-a-time file handling --
> it is easy do a global replace of READ() with FETCH. They write to temp
> tables to mimic scratch files in a series of procedural steps - it is
> hard to think of a single declarative statement to get the answer. They
> use dynamic SQL and let the user figure out how the system should work
> on the fly - they never had a software engineering course and don't
> know what coupling and cohesion are.
> Experienced database designers look for industry standard codes for
> their keys first. Try to tell the IRS you don't have anybody's SSN at
> tax time but Cindy Lou Who was employee 42 when you put her into the
> Personnel table and you have the IDENTITY value to prove it. Try to
> sell a car without a VIN - using IDENTITY for this would be like
> identifying a vehicle by the current local parking space number (think
> about it - that is exactly what an IDENTITY value is -- a parking space
> in one physical location).
> We know that we have to use industry standard codes, either de facto or
> de jure. We also appreciate the fact that we can exchange data with the
> rest of the world. We appreciate the fact that someone else will
> maintain and define these codes. A trusted external source is a good
> thing to have.
> If you don't have an industry standard or natural key and have to
> design your own codes, it is hard work to do it right. I know that
> newbie programmers want to start coding first and thinking later. It
> does not work that way and all the "wish magic" in the world will
> not change that fact. I have several chapters in my books on how to
> design encoding schemes, but that is another topic.
> The first practical consideration is that IDENTITY is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other SQL products. Newbies
> actually think they will never port code! Perhaps they only work for
> companies that are failing and will be gone. Perhaps their code is such
> crap nobody else want their application. Otherwise, you will port code;
> you will share data with some other database; data does not exist in
> isolation.
> But let's look at the logical problems. First try to create a table
> with two columns and try to make them both IDENTITY. If you cannot
> declare more than one column to be of a certain data type, then that
> thing is not a data type at all, by definition. It is a property which
> belongs to the PHYSICAL table, not the LOGICAL data in the table.
> Next, create a table with one column and make it an IDENTITY. Now try
> to insert, update and delete different numbers from it. If you cannot
> insert, update and delete rows from a table, then it is not a table by
> definition.
> Finally create a simple table with one IDENTITY and a few other
> columns. Use a few statements like
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
> To put a few rows into the table and notice that the IDENTITY
> sequentially numbered them in the PHYSICAL order they were presented.
> If you delete a row, the gap in the sequence is not filled in and the
> sequence continues from the highest number that has ever been used in
> that column in that particular table. This is how we did record numbers
> in pre-allocated sequential files in the 1950's, by the way. A utility
> program would then "pack" or "compress" the records that were flagged
> as deleted or unused to move the empty space to the PHYSICAL end of the
> PHYSICAL file. IDENTITY leaves the gaps unless you write your own
> routine to do the compression.
> But now use a statement with a query expression in it, like this:
> INSERT INTO Foobar (a, b, c)
> SELECT x, y, z
> FROM Floob;
> Since a query result is a table, and a table is a set which has no
> ordering, what should the IDENTITY numbers be? The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to number (n) rows, so which one did you pick? Why?
> The answer has been to use whatever the *PHYSICAL* order of the result
> set happened to be. That non-relational phrase "PHYSICAL order" again!
> But it is actually worse than that. If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different PHYSICAL order. Indexes and statistics are not part of the
> logical model.
> Can you explain from a logical model why the same rows in the second
> query get different IDENTITY numbers? In the relational model, they
> should be treated the same if all the values of all the attributes are
> identical.
> How do you verify that an entity has the right key when you use an
> IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and
> read it off the dashboard. If I use a UPC for a candy bar, I can read
> the bar code on the wrapper. But why is little Cindy Lou Who employee
> 42? If I turn her upside, will I find that she has 42 tattooed
> somewhere? If I call an external trusted source, will they know that
> she is employee 42?
> In the Relational Model, you do not invent a key in the storage. You
> discover a key (and the other attributes) in the real world and model
> it. If you create your own encoding for a key, then you have to
> maintain it, provide audit trails and do all the work that an industry
> standard organization would do for you.
> If I lean on a mouse button, I can insert the same data with a new
> IDENTITY over and over. Cindy Lou Who is now employed two times and
> none of my reports are right! Now I have to write some procedural code
> like a trigger or a UNIQUE constraint on her SSN to prevent this, thus
> making the IDENTITY redundant. But we were assuming that we use only
> IDENTITY as a key, so we are screwed.
> Newbies often design tables without bothering to look for a relational
> key, so they are so surprised when they do a data warehouse and nobody
> else has any idea what they are doing in their subsystem.
> Cindy Lou Who now has two rows in Personnel. When we sign her up for
> the Dental Plan, we get the row with 42. When we sign her up for the
> Bowling Team, we get the row with 43. We find our error, and delete the
> row with 42 because we have a row that was created later and we assume
> it is more current.
> Another common way to get this is to have two procedures, one for
> inserting a new employee to the Dental Plan and one for inserting a new
> employee to the Bowling Team. Both procedures create a row in Personnel
> since they use only IDENTITY as a key.
> Well, now we have an orphan row in Personnel. In fact, in such systems,
> you will find a lot of orphans. I worked for a company that used GUIDs
> for OIDs substitutes and our software fell apart in about a year of
> actual use by a client.
> Another cute way to destroy data integrity:
> WHERE id = <<some identity value>>;
> VALUES (<<recreate deleted row>>)
> END;
> Logically this should do nothing, but since IDENTITY has gaps, it
> trashes the data. When one query uses the IDENTITY and another uses the
> real key, you are like a man with two watches, you are never sure what
> time it is.
> Finally, an appeal to authority, with a quote from Dr. Codd:
> "..Database users may cause the system to generate or delete a
> surrogate, but they have no control over its value, nor is its value
> ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd,
> E. (1979), Extending the database relational model to capture more
> meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means never
> used in queries, DRI or anything else that a user does.
> Codd also wrote the following:
> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.
> (1) The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result
> that some or all of the serial numbers might be changed.).
> (2) Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.
> (3) It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it has
> ceased to have one (e.g. and applicant for a job and a retiree).
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities. A solution - proposed in part [4] and more fully in [14] - is
> to introduce entity domains which contain system-assigned surrogates.
> Database users may cause the system to generate or delete a surrogate,
> but they have no control over its value, nor is its value ever
> displayed to them....." (Codd in ACM TODS, pp 409-410).
> References
> Codd, E. (1979), Extending the database relational model to capture
> more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434
> See this article for problems in replication with IDENTITY columns.
> --CELKO--

I am strong believer in separating the data model from any and all client programming model, OOP be dammed (thanks Mr. Fabian Pascal, your concise book on RDMS should be required reading for all computer "professionals".). An analogy is collecting real world data for population studies and then using a variety of statistical models to analyze it. Received on Fri Aug 19 2005 - 17:44:58 CEST

Original text of this message