Re: Identity key?

From: DODO <dodo_at_Ihatespam.com>
Date: Tue, 9 Jan 2001 10:44:33 -0500
Message-ID: <BlG66.116723$Z2.1372722_at_nnrp1.uunet.ca>


Joe,

Thank you for the long answer.
I totally agree with the fact that IDENTITY keys are bad(at least the implementation up to System 11 was miserable, once we had to reconstruct the database because of the gaps that Sybase generated in the Identity columnsthe  server went down, the server went up and astronomic numbers "happened" in the columns). I can add that if you use them in a distributed database, you loose their primary meaning of life, unicity. I understand that you don't really disagree with using surrogate keys. Personally, I'll stick with the idea of a surrogate key as primary key and the natural(whichever seems more natural from candidate keys) as alternate. This schema has 2 advantages: it makes wonders for joins and OO mappings and provides an Enterprise wide identifier for every tuple in your databases. And they are final.

DODO PS: Sometime the natural key might not be in your best interest: I remember two guys who put a time stamp as primary key and started inserting data using the server clock as key generator. They were baffeled by the server who complained that some data integrity violation occured. But they were brave and solved the problem: they put wait(100ms) in the code.

"Joe Celko" <71062.1056_at_compuserve.com> wrote in message news:93de1q$cjk$1_at_nnrp1.deja.com...
>
> >> Then in the example
> CREATE TABLE US_States
> (state_id IDENTITY,
> state_code CHAR(2) NOT NULL PRIMARY KEY,
> state_name VARCHAR(15) NOT NULL UNIQUE);
>
> state_id would be an artificial key? <<
>
> The short answer is "yes, and it is redundant"; the real answer is that
> the IDENTITY column makes it a totally non-relational nightmare. Real
> SQL programmers use real keys and do not try to imitate 1950's magnetic
> tape or punch card systems in a relational model. In the early days of
> programming languages, we exposed a lot of the physical and hardware
> implementation to the programmer. For example, COBOL and FORTRAN
> depend on physically contiguous storage of data, which made a ROWID
> (i.e. physical location reference) possible.
>
> Later, we designed languages to independent of any physical and
> hardware implementation. Thus, an INTEGER datatype did not have to be
> eight binary bits in a two complement format. It just had to behave
> like an integer in the program and the programmer did not have to worry
> about how the hardware did its work -- or even know what the hardware
> was.
>
> SQL and other modern programming languages carry this idea further and
> try to completely separate logical and physical implementations. The
> idea is that SQL is based on sets (which have no ordering) and a
> standard that defines its logical behavior. The behavior has nothing
> to do with whether a product uses B-Tree indexes, bit vectors or
> hashing; two complement or base ten arithmetic; whether the host
> program calling the SQL is C or Cobol; etc.
>
> The IDENTITY column in SQL Server is an attempt to return to those
> thrilling days of yeasteryear, and the conceptual junk left over from
> the days when people did not know much about relational databases.
> What we knew was sequential file systems -- punch cards and magnetic
> tapes. Even the disk file systems mimicked these systems, adding only
> simple indexes.
>
> Sequence was a basic way of looking at data. People thought in terms
> of them at a primitive level. A sequence of bits make a byte, a
> sequence of bytes make a field, a sequence of fields make a record and
> sequence of records make a file. Very low level, very close to the
> machinery.
>
> Then along comes the relational model. It is based on sets; a set is a
> completed whole, without any ordering to it. No sequences! Very
> abstract! Programmers did not know how to cope, so the vendors exposed
> the physical implementation and called these things "features" and
> locked their products to particular architectures. I can go into
> details on that problem, but let me say that when we went to the bar
> after ANSI X3H2 meetings, the vendors griped about what they had to do
> to these extensions to preserve them in the next platform, how they
> could not scale up to data warehouse size databases, etc.
>
> The IDENTITY column is one of these mistakes.
>
> 1) It is not part of the SQL-92 Standard and it is highly proprietary
> to the Sybase family. It is not portable -- not quite the same thing
> as proprietary, since you can often translate one SQL dialect into
> another with a simple replacement (i.e. the % operator becomes the MOD
> () function). So your code will not move over to a new database.
>
> 2) IDENTITY looks like a datatype, but it is not. Create a table with
> one column in it and make it an IDENTITY column. Insert a number into
> the table and see what happens. Try to set it to NULL. If you cannot
> insert, update and delete all the columns, then this is not a table!
>
> 3) IDENTITY looks like a constraint, but it is not. Try to create a
> table with two IDENTITY columns and it fails. If you cannot add it to
> a column, then it is not a constraint. It is possible to write a a set
> of constraints that prohibit data from ever being put in the table
> (their predicate is always FALSE). It is possible to write a a set of
> constraints that allow anything in the table (their predicate is always
> TRUE). But no constraint can prohibit the creation of the table
> itself -- that is a meta-constraint.
>
> 4) It is not relational. Consider this statement on a table, Foo,
> which has an identity column. Assume the query returns more than one
> row.
>
> INSERT INTO Foo (x)
> SELECT a FROM Bar;
>
> You will get a result like this:
>
> IDENTITY X
> ============
> 1 'a'
> 2 'b'
> 3 'c'
>
> but if the query changed an index or was put on the physical disk data
> page differently, you might have gotten:
>
> IDENTITY X
> ============
> 1 'b'
> 2 'c'
> 3 'a'
>
> Explain why one result is the logically correct choice for an
> identifier and all other choices are not, without any reference to the
> physical implementation. You cannot.
>
> Instead of treating the query as a set, you are doing 1950's sequential
> processing using the underlying sequential file system the Sybase
> family started with.
>
> 5) If you have designed your tables correctly, they will have a
> meaningful primary key derived from the nature of the entity they
> model. The IDENTITY column should be a redundant key. The reason
> IDENTITY columns are popular as keys is that they are easy to declare.
> This is also the same reason that people build non-normalized databases
> and put pennies in fuse boxes -- easy is not right.
>
> 6) It is a bitch to do calculations on IDENTITY column values. Well,
> it was hard to do direct math on the sequential position of a record in
> a 1950's punch card system and that it what the IDENTITY is mimicking.
>
> 7) There is no check digit in an IDENTITY columns value, so you have no
> way of verifying it if you use it as a key.
>
> 8) If you use IDENTITY as a key, the values tend to cluster on physical
> data pages because they are sequential. The result is that if the most
> recent rows are the most likely to be accessed, there will be locking
> contention for control of those physical data pages. What you really
> wanted in a key is some spread of the rows over physical storage to
> avoid having every user trying to get to the same page at the same
> time.
>
> 9) The actual implementation of the IDENTITY column has been
> problematic since Version 7.0. You can look up threads in the news
> groups to get assorted tales of woe.
>
> There are other ways of getting a unique identifier for a table. The
> most portable method for getting a new identifier number which is not
> in the set is something like this:
>
> INSERT INTO Foobar (keycol, a, b, c...)
> VALUES(COALESCE((SELECT MAX(keycol) FROM Foobar) +1, 0),
> aa, bb, cc, ...);
>
> The scalar subquery expression returns the current high value for the
> key column, and then increments it. If there is no maximum value (i.e.
> this is the first row to be inserted), then it returns zero.
>
> Using this basic idea, you can replace the increment with a different
> constant or a random number generator. You can also add code to create
> a check digit.
>
> Another method is to hash the columns that make up a compound key so
> that you have single short column that can be reconstructed if you need
> to verify it.
>
> >> And you think that is useless? What happens if the state_code is
> changed by the standard comittee? <<
>
> Then you change the table to reflect the change in the external reality
> being modeled. Otherwise, your mail will not be delivered. Keys can
> change and the set of keys can grow, like any other attribute; the most
> important property of a key is that they uniquely identify an entity in
> the database. If the committee goes crazy and decides that we will
> have two states with the same two-letter postal code or two states
> named 'Texas', then we would be in trouble.
>
> Some nice properties of a key are that they do not change over time,
> they can be verified against the real world, they are short and easy to
> understand, etc. But that is not what makes them a key.
>
> >> What is a natural key? <<
>
> An attribute in the entity that makes it unique. We don't invent it;
> Mother Nature put it there. DNA and fingerprints are pretty good for
> people. Longitude and latitude are pretty good for locations. Etc.
>
> You have to do is rate the strength and verifiability of a key. If it
> changes, it should be becuase the reality changed, not because you
> moved the data inside the physical database from one table to another
> in a different order.
>
> --CELKO--
> Joe Celko, SQL Guru & DBA at Trilogy
> When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc)
> which can be cut and pasted into Query Analyzer is appreciated.
>
>
> Sent via Deja.com
> http://www.deja.com/
Received on Tue Jan 09 2001 - 16:44:33 CET

Original text of this message