Re: Identity key?

From: Larry Coon <larry_at_assist.org>
Date: Tue, 09 Jan 2001 10:45:05 -0800
Message-ID: <3A5B5C31.78AF_at_assist.org>


Joe Celko wrote:

[snipped for brevity except:]

> Real
> SQL programmers use real keys and do not try to imitate 1950's magnetic
> tape or punch card systems in a relational model. In the early days of
> programming languages, we exposed a lot of the physical and hardware
> implementation to the programmer. For example, COBOL and FORTRAN
> depend on physically contiguous storage of data, which made a ROWID
> (i.e. physical location reference) possible.
 [. . .]
> SQL and other modern programming languages carry this idea further and
> try to completely separate logical and physical implementations. The
> idea is that SQL is based on sets (which have no ordering)
 [. . .]
> Sequence was a basic way of looking at data.
 [. . .]
> Programmers did not know how to cope, so the vendors exposed
> the physical implementation and called these things "features" and
> locked their products to particular architectures.

[Disclaimer: I agree with much of what you say about identity columns. However, I disagree with some of what you wrote, and my arguing of those points should not be taken to mean I disagree with your entire point.]

You are implying that identity columns are evil because there is a connection between identity values and physical storage and that ordinality is an essential property. I use identity columns (where I feel they are justified) and never rely on either of these things. The only thing I care about is that they are UNIQUE. Yes, identities are numeric, and numbers are ordinal, and that gives you sequence. But that only makes identities bad if you rely on their ordinality.

> 2) IDENTITY looks like a datatype, but it is not. Create a table with
> one column in it and make it an IDENTITY column. Insert a number into
> the table and see what happens. Try to set it to NULL. If you cannot
> insert, update and delete all the columns, then this is not a table!

Can't set ANY primary key to null, so that's beside the point. (And in Sybase you CAN insert & modify identity values by turning on the "identity insert" option.)

> 3) IDENTITY looks like a constraint, but it is not. Try to create a
> table with two IDENTITY columns and it fails.

PRIMARY KEY looks like a constraint, but it is not. Try to create a table with two PRIMARY KEY columns and it fails.

> 4) It is not relational. Consider this statement on a table, Foo,
> which has an identity column. Assume the query returns more than one
> row.
>
> INSERT INTO Foo (x)
> SELECT a FROM Bar;
>
> You will get a result like this:
>
> IDENTITY X
> ============
> 1 'a'
> 2 'b'
> 3 'c'
>
> but if the query changed an index or was put on the physical disk data
> page differently, you might have gotten:
>
> IDENTITY X
> ============
> 1 'b'
> 2 'c'
> 3 'a'
>
> Explain why one result is the logically correct choice for an
> identifier and all other choices are not, without any reference to the
> physical implementation. You cannot.

But it's the same problem even with your "scalar subquery" alternative, isn't it? Either method depends on the order in which the rows from Bar happen to be returned. But since Bar is an unordered set, they could have come out in any order, so necessarily the values assigned (by either an identity or your alternative) are essentially random.

If you care about the difference between your two result sets above, aren't you just relying on identity values for more than they're supposed to be providing? Again, it doesn't matter WHAT the values are, only that they're UNIQUE.

> 5) If you have designed your tables correctly, they will have a
> meaningful primary key derived from the nature of the entity they
> model. The IDENTITY column should be a redundant key.

I'll spare you the gory details, but there are entities in our problem domain for which this is not true. At least, it takes the entire tuple with a lot of columns to achieve identity. And in my case, it's not a matter of understanding the problem domain well enough, as another poster suggested.

> 6) It is a bitch to do calculations on IDENTITY column values. Well,
> it was hard to do direct math on the sequential position of a record in
> a 1950's punch card system and that it what the IDENTITY is mimicking.

Why would you want to do calculations on identity column values? That's like saying statistics are evil simply because they can be misused.

> 7) There is no check digit in an IDENTITY columns value, so you have no
> way of verifying it if you use it as a key.

What would you want to verify? If its only purpose is to provide a unique value, that can be verified simply enough.

> 8) If you use IDENTITY as a key, the values tend to cluster on physical
> data pages because they are sequential. The result is that if the most
> recent rows are the most likely to be accessed, there will be locking
> contention for control of those physical data pages. What you really
> wanted in a key is some spread of the rows over physical storage to
> avoid having every user trying to get to the same page at the same
> time.

Agreed. This is a big problem with identity values.

> 9) The actual implementation of the IDENTITY column has been
> problematic since Version 7.0. You can look up threads in the news
> groups to get assorted tales of woe.

Doesn't make the concept wrong.

[Re: mutability of data in primary key columns]
> Then you change the table to reflect the change in the external reality
> being modeled. Otherwise, your mail will not be delivered. Keys can
> change and the set of keys can grow, like any other attribute; the most
> important property of a key is that they uniquely identify an entity in
> the database. If the committee goes crazy and decides that we will
> have two states with the same two-letter postal code or two states
> named 'Texas', then we would be in trouble.

This was one of the factors that went into our decision of when to use identity columns. By separating business entities from database mechanisms and relying on identities as primary keys, it's much easier (I know, that doesn't make it right) to change these values.

All that being said, we only went with identities where we felt it was justified. Either a candidate key didn't exist without taking the entire tuple; it was really inconvenient (composite of three varchar(255)'s, for example); or the values changed a lot.

And just so I don't disagree with everything you said, some additional identity disadvantages I didn't see you mention:

  • Assuming you throw an index on the "natural" key anyway (because that's how you normally look up information), an identity column means there is one more index the system has to incur the overhead of maintaining.
  • With natural primary keys there are also natural foreign keys. If information you need from the parent table is is part of the parent's primary key, it'll also be in the child table's foreign key. This means a join of the child table and parent table can be avoided. With surrogate keys, you always have to do a join with the parent table to get the information you need.

Larry Coon
University of California
larry_at_assist.org
and lmcoon_at_home.com Received on Tue Jan 09 2001 - 19:45:05 CET

Original text of this message