Re: Normalization, Natural Keys, Surrogate Keys

From: Costin Cozianu <c_cozianu_at_hotmail.com>
Date: 23 May 2002 22:42:24 -0700
Message-ID: <2cf20de8.0205232142.afff1a1@posting.google.com>

I'm affraid I have to jump late in this discussion, but this is an important topic and the lack of knowledge have already caused too much trouble in practice.

Basically I think that you're wrong in a way and you're right in a different way.

On one hand you're wrong in the sense that "surrogate" keys don't solve anything at the logical level, and relying SOLELY on them to enforce uniqueness is purely and simply bad design.

On the other hand you're kind of right because surrogate keys can make efficient 4 bytes or 8 bytes integers, that translate in a very efficient organization at the physical level.

Because in any circumstances there will always be a natural key for any table, that is at a limit all the columns put together should form a key.
However, even for unique constraints of 2-3 columns, current database implementations are disastruous in terms of performance when compared with a surrogate key. It needn't be that way but it isn't. It's the same old argument that Chris Date, Fabian Pascal and others keep pressing for many years that database industry doesn't implement the relational model, and there's a terrible confusion between logical and physical in current DBMS products.

cybermusicdude_at_aol.com (Kai Ponte) wrote in message news:<c1ec9b8f.0205210930.211d68c9_at_posting.google.com>...
>
> In reflection I tend to agree. By using an artificial surrogate key -
> aside from making the database easier to read - we avoid the problem
> of potenially changing data.

Well, I don't see how you make your database easier to read, if as in many OO designed databases I see User_OID instead of UserName, I don;t find that easier to read. Not to mention abominations like Language_OID, Zip_OID, Currency_OID and stuff like that.

The problem of changing data is indeed something to reckon with, and this is because the bad organizaqtion of physical layout of tables. A database should be able to let you UPDATE a candidate key with ease, even if that key participates in other foreign key constraints. Unfortunately, they don't, making it a pain. However, this doesn't change the problems at the logical level of design.

>What if, for example, the person
> providing the SSN provided a false SSN? Does that happen? Yes. Often.

Well, then if it happens there has to be business rules to deal with this "happening". Relying on the surrogate key, does not solve the problem, it simply says that you're hiding your head in the sand, praying for the best.
In reality potentially worse thing might happen, because you propagate the error through the data generated in your system. Your surrogate key is useless.

Because of your surrogate key, you'll let some fraudulent guys to do business with your company/ state agency using my (even yours) identity. Because what it boils down to in LOTS of business and legal transactions both with the state/the feds, but also with oprivate business, is SSN, no matter if we like it or not. That spells trouble for all the potential customers, it's true that developers/designers/dbas get off the hook easy by not worrying :)

You should have an UNIQUE constraint on that table. If duplicate SSNs happen report the error to the operator screen, add the information, to an Errors table and have business procedures in place to deal with it. You can always work things through, without compromising legitimate business rules, and being able to really identify business objects is paramount to any business system.

> What if the person doesn't have an SSN? Then the person later gets
> one. If that were the case the data with the invalid or temporary SSN
> could cause maintenance headaches with joined tables.

The same argument applies, you put it in an exceptions table, You don't drop the logical constraint of having UNIQUE SSNs.

> The problem with using the natural keys - in these examples - is that
> the business data is exposed to the logical design. By exposing the
> business data to the design we will create a maintenance nightmare if
> (when) that business data changes.

Much of the business data may change, hoever if the data needed for identification changes, you must carefully analyze, if the new data should live under the same surrogate identity. For example, if I have a surrogate key in your system, and I come to ask the operator to change my SSN, something is not quite right.

>By using a surrogate Key - say
> RowID - we avoid this maintenace issue. Take, for example, the
> following data structure (not developed by me) from my county's
> resource directory database. (http://help.sbcounty.gov)
>
> Resource
> RowID, AgencyName, AddressStreet, AddressCityID....
>
> Service
> RowID, Name
>
> ResourceService
> RowID, ResourceID, ServiceID

That's probably a bad design. Because you don't give more details (buisiness rules) nobody can't say for sure, but at the very least if ResourceService represents the relationships between Resource and Service you SHOULD drop RowID altogether and have the normal primary key (ResourceID, ServiceID).

What does it mean if you have two rows differing only by RowID ? It means nothing is a data error. Furthermore, when a program might want to delete a record, it might delete by rowID (only one of the logical duplicates) while from the logical/business rules point of view the system will still contain the reklationship between resource and service when it shouldn't.

> The primary keys in each table are the RowID. Because no business data
> is exposed as a primary key, I'm free to add and delete items as I
> feel fit. The table which joins a resource to a service does so using
> the identifier from the service table and the resource table.
>
> First off, I have many resources with the same name - i.e. Salvation
> Army or Red Cross. Since each has a unique ID, I don't have to worry
> about what name the resource is given.

As I mentioned eralier, you probably have to worry EXACTLY about that.

> If there is a duplicate the
> administrator will recognize it immediately and remove it.

Will he ? That means you shifted the burden of data integrity from the DBMS to the poor guy. What if in the meantime many other records relate dirctly or indirectly to some or all of the logical duplicates ?

>I could have used the address as part of the primary key, but...

You can't. The burden of the physical implementation of a B-tree index might be too great. However, you might give it a try, measure how the system is responding, how often you have INSERT , UPDATE, DELETE . Depending on your DB you may implement your unique constraint whith a hashmapped index, which means negligeable penalty.

Having a surrogate primary key shouldn't let you off the hook. As a rule of thumb, you always HAVE TO HAVE an alternate UNIQUE constraint.

A good rule of thumb is to always and automatically start your design by identifying natural keys and using them, and only when you discover serious difficulties at the implementation/physical level, only then you introduce surrogate keys, while you keep the UNIQUE constraints. You might want to drop the idea of declaring PRIMARY KEY in most cases and use only CONSTRAINT xxx UNIQUE ...

> Having written all that - and hopefully I made sense - I have
> discussed with some of my co-workers the subject and we can see a
> clear example of where a natural key would be beneficial. In terms of
> data warehousing applications, I can imagine that a natural key will
> provide for quicker searches than a surrogate key would.
>
> Thanks for all the input so far. This has been a great thread!

Well, sorry for being late, and probably I have repeated some of the arguments already being shuffled.

The surrogate keys in relational databases have recently been advocated strongly by OO proponents because they dream about Object Identity/pointers which are absolutely equivalent with surrogate keys, and with the old pointers in hierarchical and and network databases.

This lead to serious trouble in practice, and much of that trouble I've dealt with myself.

What is even more funny is that some very smart researchers from the OO camp have discussed and settled entirely this issue (while the opinions among relational database experts seem to be divided), but OO people are very poor readers.

You might want to have a look at:
http://citeseer.nj.nec.com/schewe98readings.html

Costin Cozianu

P.S.: I refer to surrogate keys in the sense of keys generated by the system which have no business meaning. If my bank account no is generated by the system but than appears in my bank statement and is used by me to identify the account, that I don't consider a surrogate key. Received on Fri May 24 2002 - 00:42:24 CDT