Re: dbdebunk 'Quote of Week' comment

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Tue, 23 Aug 2005 00:20:09 +0200
Message-ID: <bpgkg11qccbksdrggs7jeas55vg5noi2iq_at_4ax.com>


On 18 Aug 2005 15:06:06 -0700, -CELKO- wrote:

(snip)
>IDENTITY is an exposed PHYSICAL locator. What does that phrase mean?
>The value is created by looking at the internal state of the hardware
>at the time a PHYSICAL record containing a row is inserted into
>storage. Its purpose is to locate the row without any regard to what
>the data means.

Hi Joe,

I can't comment on IDENTITY in Sybase, but in MS SQL Server, IDENTITY is not "an exposed PHYSICAL locator" at all. Instead, it is a reliable and scalable mechanism for the generation of surrogate key values. The value created is not based on the internal state of the hardware, but at the number of preceding inserted rows, including the failed INSERTS that were rolled back. If you would say that anything that ignores ROLLBACK has no place in a relational database, you'd have a point. But your statement that IDENTITY exposes anything physical is just silly.

>Think about using a pointer or a track/sector number; same thing but
>different mechanism.

No, completely different. A pointer to a track/sector number has to change when the location of the data is changed; the identity value never changes.

>IDENTITY is not an attribute in the data model and can never be an
>attribute in the data model because it does not exist in the reality
>from which you derive your data model. The purpose is to fake a
>sequential file's positional record number, so I can reference the
>PHYSICAL storage location.

The first statement is correct. The second statement is rubbish. There is absolutely no relationship between identity values and positional record numbers.

Run this in MS SQL Server 2000:

CREATE TABLE Foobar

         (ID INT NOT NULL IDENTITY,
          Name CHAR (10) NOT NULL,
          PRIMARY KEY (Name)
         )

INSERT INTO Foobar(Name)
VALUES ('Joe')
INSERT INTO Foobar(Name)
VALUES ('Celko')

If you use the undocumented DBCC PAGE command to inspect how the data is laid out on the disk, you'll find that the row with ID = 2 comes first, and the row with ID = 1 comes second.

>More and more programmers who have absolutely no database training are
>being told to design a database.

(snip)
>Most of them simply start programming SQL as if it were their native
>programming language. They grab at GUIDs, IDENTITY, ROWID and other
>proprietary auto-numbering "features" in SQL products to imitate either
>a record number (sequential file system mindset) or OID (OO mindset)
>since they don't know anything else.

Yes, that's what newbies often do. They fail to see when to use and especially when not to use an IDENTITY. They think that IDENTITY can be used as the only key in a table. Of course, experienced database designers like you know better - you know that the choice to use a surrogate key requires you to include the real key as well, and to include a constraint that will ensure uniqueness of the real key.

BTW, newbies also abuse CHARACTER columns to store formatted dates. Does this mean that the CHARACTER datatype is inherently flawed? Or does this only mean that almost any feature of SQL can be abused by stupid people?

>Experienced database designers look for industry standard codes for
>their keys first.

Indeed, they do. And if performance is important, or if there is a fair chance that the industry standard code might change one day (longer UPC code, ZIP+, legal issues requiring you to stop using SSN, etc) and they don't want to accept the downtime to propagate such a time to all their billion-row tables, they will add a suitable surrogate key.

>Try to tell the IRS you don't have anybody's SSN at
>tax time but Cindy Lou Who was employee 42 when you put her into the
>Personnel table and you have the IDENTITY value to prove it.
(more of these examples snipped)

The essence of a surrogate key is that acts as a surrogate for the real key. The real key is still in the entities' own table, but the surrogate is used to refer to the row in foreign key relationships.

Cindy Lou Who's SSN will be in the Personnel tabel. But in the Gratifications table, you'll only find the number 42.

>But let's look at the logical problems. First try to create a table
>with two columns and try to make them both IDENTITY. If you cannot
>declare more than one column to be of a certain data type, then that
>thing is not a data type at all, by definition. It is a property which
>belongs to the PHYSICAL table, not the LOGICAL data in the table.

Correct. IDENTITY is not a data type. It's a property. Look at the table definition above - the data type is INTEGER. The identity property is a generator for surrogate keys - no point in having two surrogate keys in the same table.

>Next, create a table with one column and make it an IDENTITY. Now try
>to insert, update and delete different numbers from it. If you cannot
>insert, update and delete rows from a table, then it is not a table by
>definition.

If you have only a surrogate key column but no other column, then what is the surrogate key a surrogate for? Indeed - that is not a table.

>Finally create a simple table with one IDENTITY and a few other
>columns. Use a few statements like
>
>INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
>INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
>INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
>
>To put a few rows into the table and notice that the IDENTITY
>sequentially numbered them in the PHYSICAL order they were presented.
>If you delete a row, the gap in the sequence is not filled in and the
>sequence continues from the highest number that has ever been used in
>that column in that particular table. This is how we did record numbers
>in pre-allocated sequential files in the 1950's, by the way.

Incidentally, yes. But note that it also behaves exactly the way Dr.Codd said surrogate keys should behave: generated by the system, and users have no control over it's value.

> A utility
>program would then "pack" or "compress" the records that were flagged
>as deleted or unused to move the empty space to the PHYSICAL end of the
>PHYSICAL file. IDENTITY leaves the gaps unless you write your own
>routine to do the compression.

You can't, since SQL Server doesn't allow you to change the values in an IDENTITY column. Remember: "Database users may cause the system to generate or delete a surrogate, but they have no control over its value"

>But now use a statement with a query expression in it, like this:
>
>INSERT INTO Foobar (a, b, c)
>SELECT x, y, z
>FROM Floob;
>
>Since a query result is a table, and a table is a set which has no
>ordering, what should the IDENTITY numbers be? The entire, whole,
>completed set is presented to Foobar all at once, not a row at a time.
>There are (n!) ways to number (n) rows, so which one did you pick? Why?

Who cares? As long as they're unique, they serve their purpose.

>The answer has been to use whatever the *PHYSICAL* order of the result
>set happened to be. That non-relational phrase "PHYSICAL order" again!

So if MS decides for it's next release to use a randomizer to reshuffle the identity values assigned before storing them, you wouldn't object?

>But it is actually worse than that. If the same query is executed
>again, but with new statistics or after an index has been dropped or
>added, the new execution plan could bring the result set back in a
>different PHYSICAL order. Indexes and statistics are not part of the
>logical model.

So? Assuming you did not forget to declare the real key as either PRIMARY KEY or UNIQUE, executing the same query again would simply result in a constraint violation, and no rows inserted into the table.

>Can you explain from a logical model why the same rows in the second
>query get different IDENTITY numbers? In the relational model, they
>should be treated the same if all the values of all the attributes are
>identical.

Yes, as a matter of fact I can. To quote Dr.Codd: "Database users (...) have no control over its value".

>How do you verify that an entity has the right key when you use an
>IDENTITY?
By using the real key. As I said before: experienced database designers will never forget to include the real key along with the surrogate key; that's a mistake only newbies make.

>If I lean on a mouse button, I can insert the same data with a new
>IDENTITY over and over.

So you DID forget to include the real key in the table? Or to include the appropriate constraint?

Oh dear.

> Cindy Lou Who is now employed two times and
>none of my reports are right! Now I have to write some procedural code
>like a trigger or a UNIQUE constraint on her SSN to prevent this, thus
>making the IDENTITY redundant.

Eh? Since when is a UNIQUE constraint procedural?

And the IDENTITY is not made redundant if you declare the SSN to be UNIQUE (like you should have done in the first palce) - it's a surrogate key; it serves a different purpose. The SSN is there to relate the row in the Personnel table to the employee in front of your desk. The IDENTITY value is there to relate rows in other tables to the row in the Personnel table.

> But we were assuming that we use only
>IDENTITY as a key, so we are screwed.

Not "we" were - you were assuming that. You are collecting examples of bad code involving IDENTITY and using that to say that IDENTITY is bad. But you ignore examples of code where the IDENTITY property is used without any problems.

It's like saying that cars should be abolished because stupid people drive drunk and cause accidents.

>Another cute way to destroy data integrity:
>
>BEGIN ATOMIC
>DELETE FROM Foobar
>WHERE id = <<some identity value>>;
>INSERT INTO Foobar
>VALUES (<<recreate deleted row>>)
>END;
>
>Logically this should do nothing, but since IDENTITY has gaps, it
>trashes the data.

Eh? I see two possibilities:

  1. The row you try to delete is still refered to in another table. In that case, the DELETE will fail bacuse of a FOREIGN KEY constraint violation.
  2. The row you try to delete is not refered to in any other table. In that case, it is deleted and a new row with the same values in inserted. The IDENTITY value will indeed be different, but that's not important since it's just a surrogate. For the user (who doesn't get to see the IDENTITY value), the same data has reappeared, and there are still no other rows referencing it.

In short - the effect is no different from executing the same commands against a table without surrogate key.

>Finally, an appeal to authority, with a quote from Dr. Codd:
>"..Database users may cause the system to generate or delete a
>surrogate, but they have no control over its value, nor is its value
>ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd,
>E. (1979), Extending the database relational model to capture more
>meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
>
>This means that a surrogate ought to act like an index; created by the
>user, managed by the system and NEVER seen by a user.

Indeed.

> That means never
>used in queries, DRI or anything else that a user does.

But here, you miss the mark. If I write a query to join two tables on a surrogate key column, do I see any values? Heck, no! If I write a FOREIGN KEY constraint, do I see any value? Heck, no!

>See this article for problems in replication with IDENTITY columns.
>
http://www.simple-talk.com/2005/07/05/replication/

Did you actually read the article?

The article mentions three problems.
For the first two, a repro script is given - but if you scroll down, you'll see that there are also easy solutions supplied for these two problems.
For the third problem, no solution is given. But there's allso no repro script. Which is not surprising, when you read:

   "The third problem is typically caused when all of the columns     in a row are updated by the application, including the identity     column."
Have you ever tried to update an identity column in SQL Server? Run the CREATE TABLE and INSERT statements above, then this:

UPDATE Foobar
SET ID = 8,
       Name = 'John'
WHERE ID = 1
AND Name = 'Joe'

The result is

Server: Msg 8102, Level 16, State 1, Line 1 Cannot update identity column 'ID'.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Tue Aug 23 2005 - 00:20:09 CEST

Original text of this message