Re: identity columns

From: Jan Emil Larsen <jel_at_g-it.dk>
Date: Mon, 7 Jan 2002 08:38:23 +0100
Message-ID: <a1bj9i$l9k$1_at_news.net.uni-c.dk>


"--CELKO--" <71062.1056_at_compuserve.com> skrev i en meddelelse news:c0d87ec0.0201060830.610c4a90_at_posting.google.com...
> >> There seems to be a lot of talk in this NG about the good, bad and
> ugly
> aspects of primary keys - especially Identity columns. <<
>
> First, I would name the table "Personnel" and not "Employee" -- it is
> a set, so you want a plural or collective noun, not a singular
> (scalar) one.

I disagree.
The entity type and the base table derived from it should by named as an abstract representation of its instances, thus a singular noun, such as Employee.
Consult Date or another standard textbook.

> Next, I assume that you have a business rule that says
> everyone gets a company assigned email address when we hire them.
> Yes, I would use it; it has a beautiful property that an artifacial
> key does not -- you can verify it by sending out an email!! You know
> that they guy was fired if it bounces back to you, you can aks him his
> name, or whatever else you need to know.

I disagree. Your argument shows exactly why a key should not carry information: what if the company wants to keep a record of fired employees. A record in a table should be "a fact"; that is not the case if the person no longer holds the email-address.
A key should be imutable.

> Next bst thing to putting
> the person themselves in the database.
How do you "put the person themselves in the database"?

> The IDENTITY column is a hold over from the early programming language
> which were <i>very<i> close to the hardware. For example, the fields
> in a COBOL or FORTRAN program were assumed to be physically located in
> main storage in the order they were declared in the program. This
> meant that you could define a template that overlaid the same physical
> space and read the representation in several different ways. In
> COBOL, the command was REDEFINES, EQUIVALENCE in FORTRAN and a union
> in 'C'.
>
> From a logical viewpoint, this redefinition makes no sense at all. It
> is confusing the numeral with the number the numeral represents. The
> history of programming after this point in time has been to divorce
> the logical and physical models completely.
>
> The early SQLs were based on existing file systems. The data was kept
> in physically contiguous disk pages, in physically contiguous rows,
> made up of physically contiguous columns. In short, just like a deck
> of punch cards or a magnetic tape.
>
> But physically contiguous storage is only one way of building a
> relational database and it is not always the best one. But aside from
> that, the whole idea of a relational database is that user is not
> supposed to know how things are stored at all, much less write code
> that depends on the particular physical representation in a particular
> release of a particular product.
>
> One of the biggest errors is the IDENTITY column in the Sybase family
> (SQL Server and Sybase). People actually program with this "feature"
> and even use it as the primary key for the table! Now, let's go into
> painful details as to why this thing is bad.
>
> The practical considerations are that IDENTITY is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other products. It also
> has some very strange bugs in both Sybase and SQL Server; go to a
> newsgroup and do a search.
>
> But let's look at the logical problems. First try to create a table
> with two columns and try to make them both IDENTITY columns. If you
> cannot declare more than one column to be of a certain datatype, then
> that thing is not a datatype at all, by definition.
>
> Next, create a table with one column and make it an IDENTITY column.
> Now try to insert, update and delete different numbers from it. If
> you cannot insert, update and delete rows from a table, then it is not
> a table by definition.
>
> Finally create a simple table with one IDENTITY column and a few other
> columns. Use a few statements like
>
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
>
> to put a few rows into the table and notice that the IDENTITY column
> sequentially numbered them in the order they were presented. If you
> delete a row, the gap in the sequence is not filled in and the
> sequence continues from the highest number that has ever been used in
> that column in that particular table.
>
> But now use a statement with a query expression in it, like this:
>
> INSERT INTO Foobar (a, b, c)
> SELECT x, y, z
> FROM Floob;
>
> Since a query result is a table, and a table is a set which has no
> ordering, what should the IDENTITY numbers be? The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to number (n) rows, so which one do you pick?
> The answer has been to use whatever the physical order of the result
> set happened to be. That non-relational phrase "physical order"
> again.
>
> But it is actually worse than that. If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different physical order. Can you explain from a logical model why
> the same rows in the second query get different IDENTITY numbers? In
> the relational model, they should be treated the same if all the
> values of all the attributes are identical.
>
> Think about trying to do replication on two databases that differ only
> by an index, or by cache size or something that occasionally gives
> them different execution plans for the same statements. Want to try
> to maintain such a system?
>
> There are better ways of creating identifiers.

You are right that IDENTITY in eg. SQL Server has some bad properties as shown.
There are, however, better ways of creating surrogate keys as identifiers. While a natural candidate key in any case should be controlled by a constraint, and could of course be used as primary key, there can be very good reasons to use a surrogate key. This has nothing to do with "old programming languages" or physical dependence. As an example you might think of records of invoices, which may have a natural candidate key combined of foreign keys to Customer and Order. Nevertheless, a running integer is often used as a primary key, eg. to document (for the auditor) that no sale is forgotten . Received on Mon Jan 07 2002 - 08:38:23 CET

Original text of this message