Re: dbdebunk 'Quote of Week' comment
Date: Fri, 19 Aug 2005 00:58:40 -0000
Message-ID: <slrndgabmk.1pe.finarfinjge_at_eggertj.foo.bar>
On 2005-08-18, -CELKO- <jcelko212_at_earthlink.net> wrote:
> The fact that this question is being considered is a sign that we have
> a lot of really bad SQL programmers. By definition a relational key is
> a subset of attributes of an entity or relationship being modeled in
> the schema. This is absolute foundations of the Relational Model kind
> of stuff.
>
> IDENTITY is an exposed PHYSICAL locator. What does that phrase mean?
> The value is created by looking at the internal state of the hardware
> at the time a PHYSICAL record containing a row is inserted into
> storage. Its purpose is to locate the row without any regard to what
> the data means.
>
> Think about using a pointer or a track/sector number; same thing but
> different mechanism. But SQL does not have a pointer data type or the
> mechanisms to handle pointer operators, garbage collection and
> housekeeping, so 20+ years ago the original Sybase SQL Server exposed
> an integer that can map back to the contiguous storage model used under
> the covers.
>
> IDENTITY is not an attribute in the data model and can never be an
> attribute in the data model because it does not exist in the reality
> from which you derive your data model. The purpose is to fake a
> sequential file's positional record number, so I can reference the
> PHYSICAL storage location. Sure, I lose all the advantages of an
> abstract data model, SQL set oriented programming, carry extra data,
> destroy the portability of code and have no data integrity. Who cares?
> It is easier to write than a real normalized RDBMS schema.
>
> More and more programmers who have absolutely no database training are
> being told to design a database. It takes six years to become a
> Journeyman Union Carpenter in New York State. But a programmer with
> only a few years in a procedural or OO language is expected to produce
> a correct and usable database.
>
> What is he going to do? The smart ones will get some help and beg for
> training, knowing they are over their heads and can ruin the company.
> Most of them simply start programming SQL as if it were their native
> programming language. They grab at GUIDs, IDENTITY, ROWID and other
> proprietary auto-numbering "features" in SQL products to imitate either
> a record number (sequential file system mindset) or OID (OO mindset)
> since they don't know anything else.
>
> They write code with cursors to mimic record-at-a-time file handling --
> it is easy do a global replace of READ() with FETCH. They write to temp
> tables to mimic scratch files in a series of procedural steps - it is
> hard to think of a single declarative statement to get the answer. They
> use dynamic SQL and let the user figure out how the system should work
> on the fly - they never had a software engineering course and don't
> know what coupling and cohesion are.
>
> Experienced database designers look for industry standard codes for
> their keys first. Try to tell the IRS you don't have anybody's SSN at
> tax time but Cindy Lou Who was employee 42 when you put her into the
> Personnel table and you have the IDENTITY value to prove it. Try to
> sell a car without a VIN - using IDENTITY for this would be like
> identifying a vehicle by the current local parking space number (think
> about it - that is exactly what an IDENTITY value is -- a parking space
> in one physical location).
>
> We know that we have to use industry standard codes, either de facto or
> de jure. We also appreciate the fact that we can exchange data with the
> rest of the world. We appreciate the fact that someone else will
> maintain and define these codes. A trusted external source is a good
> thing to have.
>
> If you don't have an industry standard or natural key and have to
> design your own codes, it is hard work to do it right. I know that
> newbie programmers want to start coding first and thinking later. It
> does not work that way and all the "wish magic" in the world will
> not change that fact. I have several chapters in my books on how to
> design encoding schemes, but that is another topic.
>
> The first practical consideration is that IDENTITY is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other SQL products. Newbies
> actually think they will never port code! Perhaps they only work for
> companies that are failing and will be gone. Perhaps their code is such
> crap nobody else want their application. Otherwise, you will port code;
> you will share data with some other database; data does not exist in
> isolation.
>
> But let's look at the logical problems. First try to create a table
> with two columns and try to make them both IDENTITY. If you cannot
> declare more than one column to be of a certain data type, then that
> thing is not a data type at all, by definition. It is a property which
> belongs to the PHYSICAL table, not the LOGICAL data in the table.
>
> Next, create a table with one column and make it an IDENTITY. Now try
> to insert, update and delete different numbers from it. If you cannot
> insert, update and delete rows from a table, then it is not a table by
> definition.
>
> Finally create a simple table with one IDENTITY and a few other
> columns. Use a few statements like
>
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
>
> To put a few rows into the table and notice that the IDENTITY
> sequentially numbered them in the PHYSICAL order they were presented.
> If you delete a row, the gap in the sequence is not filled in and the
> sequence continues from the highest number that has ever been used in
> that column in that particular table. This is how we did record numbers
> in pre-allocated sequential files in the 1950's, by the way. A utility
> program would then "pack" or "compress" the records that were flagged
> as deleted or unused to move the empty space to the PHYSICAL end of the
> PHYSICAL file. IDENTITY leaves the gaps unless you write your own
> routine to do the compression.
>
> But now use a statement with a query expression in it, like this:
>
> INSERT INTO Foobar (a, b, c)
> SELECT x, y, z
> FROM Floob;
>
> Since a query result is a table, and a table is a set which has no
> ordering, what should the IDENTITY numbers be? The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to number (n) rows, so which one did you pick? Why?
> The answer has been to use whatever the *PHYSICAL* order of the result
> set happened to be. That non-relational phrase "PHYSICAL order" again!
>
> But it is actually worse than that. If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different PHYSICAL order. Indexes and statistics are not part of the
> logical model.
>
> Can you explain from a logical model why the same rows in the second
> query get different IDENTITY numbers? In the relational model, they
> should be treated the same if all the values of all the attributes are
> identical.
>
> How do you verify that an entity has the right key when you use an
> IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and
> read it off the dashboard. If I use a UPC for a candy bar, I can read
> the bar code on the wrapper. But why is little Cindy Lou Who employee
> 42? If I turn her upside, will I find that she has 42 tattooed
> somewhere? If I call an external trusted source, will they know that
> she is employee 42?
>
> In the Relational Model, you do not invent a key in the storage. You
> discover a key (and the other attributes) in the real world and model
> it. If you create your own encoding for a key, then you have to
> maintain it, provide audit trails and do all the work that an industry
> standard organization would do for you.
>
> If I lean on a mouse button, I can insert the same data with a new
> IDENTITY over and over. Cindy Lou Who is now employed two times and
> none of my reports are right! Now I have to write some procedural code
> like a trigger or a UNIQUE constraint on her SSN to prevent this, thus
> making the IDENTITY redundant. But we were assuming that we use only
> IDENTITY as a key, so we are screwed.
>
> Newbies often design tables without bothering to look for a relational
> key, so they are so surprised when they do a data warehouse and nobody
> else has any idea what they are doing in their subsystem.
>
> Cindy Lou Who now has two rows in Personnel. When we sign her up for
> the Dental Plan, we get the row with 42. When we sign her up for the
> Bowling Team, we get the row with 43. We find our error, and delete the
> row with 42 because we have a row that was created later and we assume
> it is more current.
>
> Another common way to get this is to have two procedures, one for
> inserting a new employee to the Dental Plan and one for inserting a new
> employee to the Bowling Team. Both procedures create a row in Personnel
> since they use only IDENTITY as a key.
>
> Well, now we have an orphan row in Personnel. In fact, in such systems,
> you will find a lot of orphans. I worked for a company that used GUIDs
> for OIDs substitutes and our software fell apart in about a year of
> actual use by a client.
>
> Another cute way to destroy data integrity:
>
> BEGIN ATOMIC
> DELETE FROM Foobar
> WHERE id = <<some identity value>>;
> INSERT INTO Foobar
> VALUES (<<recreate deleted row>>)
> END;
>
> Logically this should do nothing, but since IDENTITY has gaps, it
> trashes the data. When one query uses the IDENTITY and another uses the
> real key, you are like a man with two watches, you are never sure what
> time it is.
>
> Finally, an appeal to authority, with a quote from Dr. Codd:
> "..Database users may cause the system to generate or delete a
> surrogate, but they have no control over its value, nor is its value
> ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd,
> E. (1979), Extending the database relational model to capture more
> meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means never
> used in queries, DRI or anything else that a user does.
>
> Codd also wrote the following:
>
> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.
>
> (1) The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result
> that some or all of the serial numbers might be changed.).
>
> (2) Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.
>
> (3) It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it has
> ceased to have one (e.g. and applicant for a job and a retiree).
>
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities. A solution - proposed in part [4] and more fully in [14] - is
> to introduce entity domains which contain system-assigned surrogates.
> Database users may cause the system to generate or delete a surrogate,
> but they have no control over its value, nor is its value ever
> displayed to them....." (Codd in ACM TODS, pp 409-410).
>
> References
>
> Codd, E. (1979), Extending the database relational model to capture
> more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434
>
> See this article for problems in replication with IDENTITY columns.
> http://www.simple-talk.com/2005/07/05/replication/
>
> --CELKO--
> jcelko212_at_earthlink.net
>
Hello Joe:
I'd be an idiot to question anything above, but I will make a comment about
SSN's. Here in Canada, a SIN. There are many situations where one needs to
Cheers.
JE Received on Fri Aug 19 2005 - 02:58:40 CEST