Re: dbdebunk 'Quote of Week' comment

From: vldm10 <vldm10_at_yahoo.com>
Date: 24 Aug 2005 12:30:07 -0700
Message-ID: <1124911807.671211.306200_at_z14g2000cwz.googlegroups.com>


Hugo Kornelis wrote:
> On 18 Aug 2005 15:06:06 -0700, -CELKO- wrote:
>
> (snip)
> >IDENTITY is an exposed PHYSICAL locator. What does that phrase mean?
> >The value is created by looking at the internal state of the hardware
> >at the time a PHYSICAL record containing a row is inserted into
> >storage. Its purpose is to locate the row without any regard to what
> >the data means.
>
> Hi Joe,
>
> I can't comment on IDENTITY in Sybase, but in MS SQL Server, IDENTITY is
> not "an exposed PHYSICAL locator" at all. Instead, it is a reliable and
> scalable mechanism for the generation of surrogate key values. The value
> created is not based on the internal state of the hardware, but at the
> number of preceding inserted rows, including the failed INSERTS that
> were rolled back. If you would say that anything that ignores ROLLBACK
> has no place in a relational database, you'd have a point. But your
> statement that IDENTITY exposes anything physical is just silly.
>
> >Think about using a pointer or a track/sector number; same thing but
> >different mechanism.
>
> No, completely different. A pointer to a track/sector number has to
> change when the location of the data is changed; the identity value
> never changes.
>
> >IDENTITY is not an attribute in the data model and can never be an
> >attribute in the data model because it does not exist in the reality
> >from which you derive your data model. The purpose is to fake a
> >sequential file's positional record number, so I can reference the
> >PHYSICAL storage location.
>
> The first statement is correct. The second statement is rubbish. There
> is absolutely no relationship between identity values and positional
> record numbers.
>
> Run this in MS SQL Server 2000:
>
> CREATE TABLE Foobar
> (ID INT NOT NULL IDENTITY,
> Name CHAR (10) NOT NULL,
> PRIMARY KEY (Name)
> )
> INSERT INTO Foobar(Name)
> VALUES ('Joe')
> INSERT INTO Foobar(Name)
> VALUES ('Celko')
>
> If you use the undocumented DBCC PAGE command to inspect how the data is
> laid out on the disk, you'll find that the row with ID = 2 comes first,
> and the row with ID = 1 comes second.
>
> >More and more programmers who have absolutely no database training are
> >being told to design a database.
> (snip)
> >Most of them simply start programming SQL as if it were their native
> >programming language. They grab at GUIDs, IDENTITY, ROWID and other
> >proprietary auto-numbering "features" in SQL products to imitate either
> >a record number (sequential file system mindset) or OID (OO mindset)
> >since they don't know anything else.
>
> Yes, that's what newbies often do. They fail to see when to use and
> especially when not to use an IDENTITY. They think that IDENTITY can be
> used as the only key in a table. Of course, experienced database
> designers like you know better - you know that the choice to use a
> surrogate key requires you to include the real key as well, and to
> include a constraint that will ensure uniqueness of the real key.

To avoid ambiguity we should emphasize that two types of surrogate were discussed in this thread.
Surrogates in the Codd sense - system generated and maintained entity identifiers, and surrogates that are user defined and user controlled, which were in the RM before 1979.

> BTW, newbies also abuse CHARACTER columns to store formatted dates. Does
> this mean that the CHARACTER datatype is inherently flawed? Or does this
> only mean that almost any feature of SQL can be abused by stupid people?
>
> >Experienced database designers look for industry standard codes for
> >their keys first.
>
> Indeed, they do. And if performance is important, or if there is a fair
> chance that the industry standard code might change one day (longer UPC
> code, ZIP+, legal issues requiring you to stop using SSN, etc) and they
> don't want to accept the downtime to propagate such a time to all their
> billion-row tables, they will add a suitable surrogate key.
>
> >Try to tell the IRS you don't have anybody's SSN at
> >tax time but Cindy Lou Who was employee 42 when you put her into the
> >Personnel table and you have the IDENTITY value to prove it.
> (more of these examples snipped)
>
> The essence of a surrogate key is that acts as a surrogate for the real
> key. The real key is still in the entities' own table, but the surrogate
> is used to refer to the row in foreign key relationships.
>
> Cindy Lou Who's SSN will be in the Personnel tabel. But in the
> Gratifications table, you'll only find the number 42.
>
> >But let's look at the logical problems. First try to create a table
> >with two columns and try to make them both IDENTITY. If you cannot
> >declare more than one column to be of a certain data type, then that
> >thing is not a data type at all, by definition. It is a property which
> >belongs to the PHYSICAL table, not the LOGICAL data in the table.
>
> Correct. IDENTITY is not a data type. It's a property. Look at the table
> definition above - the data type is INTEGER.
> The identity property is a generator for surrogate keys - no point in
> having two surrogate keys in the same table.
>
> >Next, create a table with one column and make it an IDENTITY. Now try
> >to insert, update and delete different numbers from it. If you cannot
> >insert, update and delete rows from a table, then it is not a table by
> >definition.
>
> If you have only a surrogate key column but no other column, then what
> is the surrogate key a surrogate for? Indeed - that is not a table.
>
> >Finally create a simple table with one IDENTITY and a few other
> >columns. Use a few statements like
> >
> >INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> >INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> >INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
> >
> >To put a few rows into the table and notice that the IDENTITY
> >sequentially numbered them in the PHYSICAL order they were presented.
> >If you delete a row, the gap in the sequence is not filled in and the
> >sequence continues from the highest number that has ever been used in
> >that column in that particular table. This is how we did record numbers
> >in pre-allocated sequential files in the 1950's, by the way.
>
> Incidentally, yes. But note that it also behaves exactly the way Dr.Codd
> said surrogate keys should behave: generated by the system, and users
> have no control over it's value.
>
> > A utility
> >program would then "pack" or "compress" the records that were flagged
> >as deleted or unused to move the empty space to the PHYSICAL end of the
> >PHYSICAL file. IDENTITY leaves the gaps unless you write your own
> >routine to do the compression.
>
> You can't, since SQL Server doesn't allow you to change the values in an
> IDENTITY column. Remember: "Database users may cause the system to
> generate or delete a surrogate, but they have no control over its value"
>
> >But now use a statement with a query expression in it, like this:
> >
> >INSERT INTO Foobar (a, b, c)
> >SELECT x, y, z
> >FROM Floob;
> >
> >Since a query result is a table, and a table is a set which has no
> >ordering, what should the IDENTITY numbers be? The entire, whole,
> >completed set is presented to Foobar all at once, not a row at a time.
> >There are (n!) ways to number (n) rows, so which one did you pick? Why?
>
> Who cares? As long as they're unique, they serve their purpose.
>
> >The answer has been to use whatever the *PHYSICAL* order of the result
> >set happened to be. That non-relational phrase "PHYSICAL order" again!
>
> So if MS decides for it's next release to use a randomizer to reshuffle
> the identity values assigned before storing them, you wouldn't object?
>
> >But it is actually worse than that. If the same query is executed
> >again, but with new statistics or after an index has been dropped or
> >added, the new execution plan could bring the result set back in a
> >different PHYSICAL order. Indexes and statistics are not part of the
> >logical model.
>
> So? Assuming you did not forget to declare the real key as either
> PRIMARY KEY or UNIQUE, executing the same query again would simply
> result in a constraint violation, and no rows inserted into the table.
>
> >Can you explain from a logical model why the same rows in the second
> >query get different IDENTITY numbers? In the relational model, they
> >should be treated the same if all the values of all the attributes are
> >identical.
>
> Yes, as a matter of fact I can. To quote Dr.Codd: "Database users (...)
> have no control over its value".
>
> >How do you verify that an entity has the right key when you use an
> >IDENTITY?
>
> By using the real key. As I said before: experienced database designers
> will never forget to include the real key along with the surrogate key;
> that's a mistake only newbies make.
>
> >If I lean on a mouse button, I can insert the same data with a new
> >IDENTITY over and over.
>
> So you DID forget to include the real key in the table? Or to include
> the appropriate constraint?
>
> Oh dear.
>
> > Cindy Lou Who is now employed two times and
> >none of my reports are right! Now I have to write some procedural code
> >like a trigger or a UNIQUE constraint on her SSN to prevent this, thus
> >making the IDENTITY redundant.
>
> Eh? Since when is a UNIQUE constraint procedural?
>
> And the IDENTITY is not made redundant if you declare the SSN to be
> UNIQUE (like you should have done in the first palce) - it's a surrogate
> key; it serves a different purpose. The SSN is there to relate the row
> in the Personnel table to the employee in front of your desk. The
> IDENTITY value is there to relate rows in other tables to the row in the
> Personnel table.
>
> > But we were assuming that we use only
> >IDENTITY as a key, so we are screwed.
>
> Not "we" were - you were assuming that. You are collecting examples of
> bad code involving IDENTITY and using that to say that IDENTITY is bad.
> But you ignore examples of code where the IDENTITY property is used
> without any problems.
>
> It's like saying that cars should be abolished because stupid people
> drive drunk and cause accidents.
>
> >Another cute way to destroy data integrity:
> >
> >BEGIN ATOMIC
> >DELETE FROM Foobar
> >WHERE id = <<some identity value>>;
> >INSERT INTO Foobar
> >VALUES (<<recreate deleted row>>)
> >END;
> >
> >Logically this should do nothing, but since IDENTITY has gaps, it
> >trashes the data.
>
> Eh? I see two possibilities:
>
> 1. The row you try to delete is still refered to in another table. In
> that case, the DELETE will fail bacuse of a FOREIGN KEY constraint
> violation.
> 2. The row you try to delete is not refered to in any other table. In
> that case, it is deleted and a new row with the same values in inserted.
> The IDENTITY value will indeed be different, but that's not important
> since it's just a surrogate. For the user (who doesn't get to see the
> IDENTITY value), the same data has reappeared, and there are still no
> other rows referencing it.
>
> In short - the effect is no different from executing the same commands
> against a table without surrogate key.
>
> >Finally, an appeal to authority, with a quote from Dr. Codd:
> >"..Database users may cause the system to generate or delete a
> >surrogate, but they have no control over its value, nor is its value
> >ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd,
> >E. (1979), Extending the database relational model to capture more
> >meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
> >
> >This means that a surrogate ought to act like an index; created by the
> >user, managed by the system and NEVER seen by a user.
>
> Indeed.
>
> > That means never
> >used in queries, DRI or anything else that a user does.
>
> But here, you miss the mark. If I write a query to join two tables on a
> surrogate key column, do I see any values? Heck, no!
> If I write a FOREIGN KEY constraint, do I see any value? Heck, no!
>
> >See this article for problems in replication with IDENTITY columns.
> >http://www.simple-talk.com/2005/07/05/replication/
>
> Did you actually read the article?
>
> The article mentions three problems.
> For the first two, a repro script is given - but if you scroll down,
> you'll see that there are also easy solutions supplied for these two
> problems.
> For the third problem, no solution is given. But there's allso no repro
> script. Which is not surprising, when you read:
> "The third problem is typically caused when all of the columns
> in a row are updated by the application, including the identity
> column."
> Have you ever tried to update an identity column in SQL Server? Run the
> CREATE TABLE and INSERT statements above, then this:
>
> UPDATE Foobar
> SET ID = 8,
> Name = 'John'
> WHERE ID = 1
> AND Name = 'Joe'
>
> The result is
>
> Server: Msg 8102, Level 16, State 1, Line 1
> Cannot update identity column 'ID'.
>
> Best, Hugo
> --
>
> (Remove _NO_ and _SPAM_ to get my e-mail address)

Vladimir Odrljin Received on Wed Aug 24 2005 - 21:30:07 CEST

Original text of this message