Re: Theory ques: To autogenerate PK or not???????

From: Joe Celko <71062.1056_at_compuserve.com>
Date: Mon, 16 Oct 2000 17:01:17 GMT
Message-ID: <8sfc8n$ncn$1_at_nnrp1.deja.com>


>> any help or religious beliefs would be most appreciated <<

Hre is my standard rant against the IDENTITY columns in SQL Server:

In the early days of programming languages, we exposed a lot of the physical and hardware implementation to the programmer. For example, COBOL and FORTRAN depend on physically contiguous storage of data.

Later, we designed languages to independent of any physical and hardware implementation. Thus, an INTEGER datatype did not have to be eight binary bits in a two complement format. It just had to behave like an integer in the program and the programmer did not have to worry about how the hardware did its work -- or even know what the hardware.

SQL and other modern programming languages carry this idea further and try to completely separate logical and physical implementations. The idea is that SQL is based on sets (which have no ordering) and a standard that defines its logical behavior. The behavior has nothing to do with whether a product uses B-Tree indexes, bit vectors or hashing; two complement or base ten arithmetic; whether the host program calling the SQL is C or Cobol; etc.

The IDENTITY column in SQL Server is an attempt to return to those thrilling days of yeasteryear Kba left over from the days when people did not know much about relational databases. But we knew sequential file systems -- punch cards and magnetic tapes. Even the disk file systems mimicked these systems, adding only simple indexes.

Sequence was a basic way of looking at data. People thought in terms of them at a primitive level. A sequence of bits make a byte, a sequence of bytes make a field, a sequence of fields make a record and sequence of records make a file.

Then along comes the relational model. It is based on sets; a set is a completed whole, without any ordering to it. No sequences! Programmers did not know how to cope, so the vendors exposed the physical implementation and called these things "features" and locked their products to particular architectures. I can go into details on that problem, but let me say that when we went to the bar after ANSI X3H2 meetings, the vendors griped about what they had to do to these extensions to preserve them in the next platform, how they could not scale up to data warehouse size databases, etc.

The IDENTITY column is one of these mistakes.

  1. It is not part of the SQL-92 Standard and it is highly proprietary to the Sybase family. It is not portable -- not quite the same thing as proprietary, since you can often translate one SQL dialect into another with a simple replacement (i.e. the % operator becomes the MOD () function). So your code will not move over to a new database.
  2. IDENTITY looks like a datatype, but it is not. Create a table with one column in it and make it an IDENTITY column. Insert a number into the table and see what happens. Try to set it to NULL. If you cannot insert, update and delete all the columns, then this is not a table!
  3. IDENTITY looks like a constraint, but it is not. Try to create a table with two IDENTITY columns and it fails. If you cannot add it to a column, then it is not a constraint. It is possible to write a a set of constraints that prohibit data from ever being put in the table (their predicate is always FALSE). It is possible to write a a set of constraints that allow anything in the table (their predicate is always TRUE). But no constraint can prohibit the creation of the table itself -- that is a meta-constraint.
  4. It is not relational. Consider this statement on a table, Foo, which has an identity column. Assume the query returns more than one row.

 INSERT INTO Foo (x)
 SELECT a FROM Bar;

You will get a result like this:

 IDENTITY X


   1      'a'
   2      'b'
   3      'c'

but if the query changed an index or was put on the physical disk data page differently, you might have gotten:

 IDENTITY X


   1      'b'
   2      'c'
   3      'a'

Explain why one result is the logically correct choice for an identifier and all other choices are not, without any reference to the physical implementation. You cannot.

Instead of treating the query as a set, you are doing 1950's sequential processing using the underlying sequential file system the Sybase family started with.

5) If you have designed your tables correctly, they will have a meaningful primary key derived from the nature of the entity they model. The IDENTITY column should be a redundant key. The reason IDENTITY columns are popular as keys is that they are easy to declare. This is also the same reason that people build non-normalized databases and put pennies in fuse boxes -- easy is not right.

6) It is a bitch to do calculations on IDENTITY column values. Well, it was hard to do direct math on the sequential position of a record in a 1950's punch card system and that it what the IDENTITY is mimicking.

7) There is no check digit in an IDENTITY columns value, so you have no way of verifying it if you use it as a key.

8) If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential. The result is that if the most recent rows are the most likely to be accessed, there will be locking contention for control of those physical data pages. What you really wanted in a key is some spread of the rows over physical storage to avoid having every user trying to get to the same page at the same time.

9) The actual implementation of the IDENTITY column has been problematic since Version 7.0. You can look up threads in the news groups to get assorted tales of woe.

There are other ways of getting a unique identifier for a table. The most portable method for getting a new identifier number which is not in the set is something like this:

 INSERT INTO Foobar (keycol, a, b, c...)  VALUES(COALESCE((SELECT MAX(keycol) FROM Foobar) +1, 0),

         aa, bb, cc, ...);

The scalar subquery expression returns the current high value for the key column, and then increments it. If there is no maximum value (i.e. this is the first row to be inserted), then it returns zero.

Using this basic idea, you can replace the increment with a different constant or a random number generator. You can also add code to create a check digit.

Another method is to hash the columns that make up a compound key so that you have single short column that can be reconstructed if you need to verify it.

--CELKO--
Joe Celko, SQL Guru & DBA at Trilogy
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc) which can be cut and pasted into Query Analyzer is appreciated.

Sent via Deja.com http://www.deja.com/
Before you buy. Received on Mon Oct 16 2000 - 19:01:17 CEST

Original text of this message