Re: Natural keys vs Aritficial Keys

From: Walter Mitty <wamitty_at_verizon.net>
Date: Sat, 16 May 2009 17:58:42 GMT
Message-ID: <m9DPl.1724$5F2.621_at_nwrddc01.gnilink.net>


"Troels Arvin" <troels_at_arvin.dk> wrote in message news:gukkfk$ohk$1_at_news.net.uni-c.dk...
> Walter Mitty wrote:
>> Every time the issue of natural keys comes up in Stackoverflow.com, the
>> prevailing view seems to be that the best primary keys are artificial
>> and opaque.
>
> I've seen the same tendency at Wikipedia. I've tried to engage in
> discussions about it there, but I've given up (sometimes, you just have
> to give up on Wikipedia). In a discussion, I even provided benchmarks to
> the most eager surrogate-proponent, showing worse performance for a type
> of query when surrogate keys were used -- but this was ignored.
>
> And I believe that the tendency is due to
>
> - Misguidedness. When I started in the programming field,
> I thought that every table should have a generated integer
> "id" column; it somehow felt "nice". But after a dose of
> education and experience, the fixed idea went away.
> I think many people find it somehow comforting to have
> an "id" handle, instead of appreciating the idea of
> letting the data itself be the driving force. The
> navigational versus the declarative approach.
>
> - Deficiencies in object relational mappers which
> either don't support multi-column keys, or don't make
> it easy to handle multi-column keys.
>
> - Some DBMSes which don't support CASCADE in definition
> of foreign keys.
>
> Some of the performance arguments may be valid: There are cases where
> they are more efficient than natural keys, although the extent of it is
> probably seldom measured, making it a premature optimization.
> On the other hand, there are cases where the use of surrogate keys
> results in _worse_ performance, due to the need for extra joins, for
> example. And all else being equal, the surrogate columns add extra bytes
> to the database; if all tables have unneeded surrogate keys, it can
> actually amount to a sizeable amount of "dead" bytes in the database.
>
> But the situation is also due to the SQL standard and the DBMS
> implementors:
>
> - Many people stay away from NATURAL JOIN because it's
> seen as being dangerous, having the potential for being
> source of bad, subtle problems if a column is renamed.
> I wish the SQL standard had defined NATURAL JOIN to
> be based on referential integrity definitions (foreign
> keys) instead of naming conventions. Maybe the standard
> could be extended with a new "NATURAL REFERENTIAL JOIN"
> or simply "REFERENTIAL JOIN", so that joins could be
> expressed in a terse and safe way, even where multi-column
> keys are involved.
>
> - I know of no DBMS which implement the obvious optimization
> where a "costly" natural key based referential integrity
> constraint (such as multi-column keys involving textual
> columns) is handled behind the scenes through hidden
> pointer-like (integer?) row IDs.
>
>> Is data integrity less important than it used to be?
>
> It seems so :-(
>
> And not just integrity which is at risk: The less you tell the DBMS about
> your data, the less it can use the knowledge for semantic query
> optimization. Also, when you don't express known constraints about your
> data, the application can't make use of this metadata knowledge for GUI
> hints, etc.
>
> --
> Troels

Thanks for your response. I think that both stackoverflow and Wikipedia make it possible to confuse popularity with excellence.

Perhaps one just has to accept that some responses don't get the recognition they deserve. By way of an analogy, in a democracy, sometimes the electorate goofs.

But I think there's something more substantive going on here. I think that the design center of database design has shifted from the support of an enterprise need to manage and share data to an application's need for a reliable persistent store. Most of the people voting are not DBA's or database designers for an enterprise integrating database. They are developers who are getting an application ready for sale. The measure of "goodness" changes accordingly.

As far as efficiency goes, I used to make a dramatic difference between inefficiencies that tacked an extra 10% onto a retrieval, and inefficiencies that turned minutes of delay into hours of delay. I found that keeping the design simple and sound, and keeping things flexible enough to make fine tuning possible were much more important that squeezing every last drop of performance out of every single design decision. Maybe today's developers are like a horse owner trying to win the Kentucky derby, and considering second place a "loss".

Speaking of efficiency, composite primary keys have almost no effect, if you get the indexes right. A primary key on (StudentID, CourseID) is equivalent to a key on (CourseID, StudentID). However an index on (StudentID, CourseID) could have a very different performance impact than an index on (CourseID, StudentID). Unfortunately, some people judge the performance impact of a compound key based on the impact of the freebie index the DBMS makes for you.
(I know, I know, "freebie" is the wrong word!).

I also remember a time when unique indexes used to perform much, much better than indexes that permitted duplicates. This was a product specific deficiency.

Thanks again. Received on Sat May 16 2009 - 19:58:42 CEST

Original text of this message