Re: Natural keys vs Aritficial Keys

From: Troels Arvin <troels_at_arvin.dk>
Date: Fri, 15 May 2009 20:48:52 +0000 (UTC)
Message-ID: <gukkfk$ohk$1_at_news.net.uni-c.dk>


Walter Mitty wrote:
> Every time the issue of natural keys comes up in Stackoverflow.com, the
> prevailing view seems to be that the best primary keys are artificial
> and opaque.

I've seen the same tendency at Wikipedia. I've tried to engage in discussions about it there, but I've given up (sometimes, you just have to give up on Wikipedia). In a discussion, I even provided benchmarks to the most eager surrogate-proponent, showing worse performance for a type of query when surrogate keys were used -- but this was ignored.

And I believe that the tendency is due to

  • Misguidedness. When I started in the programming field, I thought that every table should have a generated integer "id" column; it somehow felt "nice". But after a dose of education and experience, the fixed idea went away. I think many people find it somehow comforting to have an "id" handle, instead of appreciating the idea of letting the data itself be the driving force. The navigational versus the declarative approach.
  • Deficiencies in object relational mappers which either don't support multi-column keys, or don't make it easy to handle multi-column keys.
  • Some DBMSes which don't support CASCADE in definition of foreign keys.

Some of the performance arguments may be valid: There are cases where they are more efficient than natural keys, although the extent of it is probably seldom measured, making it a premature optimization.

   On the other hand, there are cases where the use of surrogate keys results in _worse_ performance, due to the need for extra joins, for example. And all else being equal, the surrogate columns add extra bytes to the database; if all tables have unneeded surrogate keys, it can actually amount to a sizeable amount of "dead" bytes in the database.

But the situation is also due to the SQL standard and the DBMS implementors:

  • Many people stay away from NATURAL JOIN because it's seen as being dangerous, having the potential for being source of bad, subtle problems if a column is renamed. I wish the SQL standard had defined NATURAL JOIN to be based on referential integrity definitions (foreign keys) instead of naming conventions. Maybe the standard could be extended with a new "NATURAL REFERENTIAL JOIN" or simply "REFERENTIAL JOIN", so that joins could be expressed in a terse and safe way, even where multi-column keys are involved.
  • I know of no DBMS which implement the obvious optimization where a "costly" natural key based referential integrity constraint (such as multi-column keys involving textual columns) is handled behind the scenes through hidden pointer-like (integer?) row IDs.

> Is data integrity less important than it used to be?

It seems so :-(

And not just integrity which is at risk: The less you tell the DBMS about your data, the less it can use the knowledge for semantic query optimization. Also, when you don't express known constraints about your data, the application can't make use of this metadata knowledge for GUI hints, etc.

-- 
Troels
Received on Fri May 15 2009 - 22:48:52 CEST

Original text of this message