Re: Normalization, Natural Keys, Surrogate Keys

From: Costin Cozianu <c_cozianu_at_hotmail.com>
Date: 26 May 2002 10:01:59 -0700
Message-ID: <2cf20de8.0205260901.19582f0a@posting.google.com>

> > >What if, for example, the person
> > > providing the SSN provided a false SSN? Does that happen? Yes. Often.
> >
> > Well, then if it happens there has to be business rules to deal with
> > this "happening". Relying on the surrogate key, does not solve the
> > problem, it simply says that you're hiding your head in the sand,
> > praying for the best.
>
> Yes, and - I'm sorry - I didn't make that clear in my post. Relying
> soley on the DB to enforce anything is - in my book - not a good
> practice. I believe wholly that the database is part of an
> application. It should be designed as part of the application and not
> in a theoretical vacumn. I believe, therefore, the database designer
> needs to work with the appliccation designers.
>
> In the case I demonstrated above, which is a real-world example, the
> end users are required to generate a "fake" SSN for new people who
> either don't have an SSN assigned yet or cannot remember thiers. The
> end users are typically instructed to provide 999-99-9999 as the SSN.
>
> In terms of checking for a duplicate when a surrogate key is used then
> that can be done through the application. If this is practiced then
> you can choose what to do with the duplicate entry - reject it, allow
> it, modify the existing entry.

Why not have the database enforce validity of data ? The general rule is that business rules, validity constraints, and so on should be enforced by the database as much as possible, and this is for good reason.

First of all because it is easier and safer, in database you declaratively specify constraints, while relying that all programmers write bug-free programs and read all the applicable design documentation as to make sure that your database is sound, well, that's not a very safe approach and it's kind of asking for trouble. Even if you have constraints enforced by the application and checking them in the database seems redundant, it's a safety measure nevertheless.

For example, in the above, because you allow for that 999999999 for situations that are very rare (let's say <1% or <5% of the cases), you fail to protect the vast majority of cases, because you won't be able to declarer the UNIQUE constraint on the SSN. There are better way to deal with this, one example that I suggested is create an SSN_Exceptions table where you put the data collected for the persons who don't have a SSN or their SSN has problems.

Then, you rely on trhe app programmers to check (for duplicates) before insert, most of whom will probably not do becausde they've been indoctrinated by objects and object identity, and worse, they might use Object/Relational frameworks (even worse, Enterprise Java Beans) that work on cruise-controle, they're happy with the artificial identity (RowID in this case). So don't be surprise if duplicate SSNs will appear.

The right way to do it is to have an UNIQUE constraint, the application will try to insert the data (without checking for duplicates), and the database will fail the INSERT in case of invalid data, then the application can catch the exception and follow the business rules for dealing with these exceptions.

> > For example, if I have a surrogate key in your system, and I come to
> > ask the operator to change my SSN, something is not quite right.
>
> Um, why? You mean not right with the DBMS or not legally right?

Not legally right, first of all. If you are the DMV you shouldn;t give out different licenses to different persons with the same SSN. Potentially one might be a fraud. Even if it was an honest mistake (operator error) , because insurance companies, for example ,might lookup my driving record by my SSN, I can suffer the other's person violations. A repord writer could easily aggregate by SSN.

From the point of vuiew of good DBMS practices it's also a disaster, because that is what an DBMS can easily avoid, while it will requitre the DBA and other people a lot of hastle to manually correct the situation (if they ever do it). Before the DBA sees it, other records might propagate (even to other systems) with the wrong correlation to the wrong SSN.

That is chiefly because a KEY should be something to solve what we call the identification problem (identifying the business concept we're talking about), and not merely a convenience to identify something internal to the system (a row in a table). With surrogate keys, relational databases become no better than COBOL index files, because the surrogate key takes the place of the record pointer.

> > That's probably a bad design.
> Probably is a bad desing in terms of theory, but it works for us. In
> designing databases with well over 50 tables and millions of records,
> it makes for very easy maintenance and upgrades, which is my primary
> goal as a developer and project manager.

It is true, that having multi-column keys might make it a bit harder for developers, report geneartors and other people to do multi column joins.

However:

with a decently designed object oriented framework this shouldn't be a problem, developers, should manipulate a PrimaryKey object, and let the framework generate joins and other conditions.
also with component oriented tools like VB, Delphi, it's a matter of clicking in the wizards, and it's not such a terrible hardship.
you can always have an alternative surrogate key , to make it more convenient as long as you took the necessary precautions that the important natural key conditions are enforced. So the surrogate keys, it's not part of any valid logical design, it's merely an implementation artifice, and if we had better technology (both OO and relational) we should do just fine without surrogates.

You say you're happy that by automating the access method to all the tables by a column named RowID, you get developers happy, project completed faster, etc. But you do this at the expense of lowering the quality of data management, and potentially you might get into other troubles (like putting more logic into application to deal with conditions that should have been dealt with by the database).

With a little bit of extra-effort taking the natural key approach and using surrogates only where there's good reasons, you have a much better quiality database and application design, and the clients will avoid a lot of hasle down the road.

> > but at the very least if
> > ResourceService represents the relationships between Resource and
> > Service you SHOULD drop RowID altogether and have the normal primary
> > key (ResourceID, ServiceID).
> Which would bring me back in to the nightmare of reading goofy primary
> keys joined out of multiple columns - exactly the sort of thing I am
> avoiding. Besides if 49 tables use RowID as the primary key and one
> doesn't that breaks my design apart.
>
>
> > What does it mean if you have two rows differing only by RowID ? It
> > means nothing is a data error.
> Huh? :)

Every row in a relational database, is a logical propostion. The relational model has two keys of interpreattions: the algebra of relations, or first order logic. It is the logical interpretation that makes it extremely valuable in developing and deploying systems.

So in this case, if you have a table T(ResourceID, ServiceID) that is a declared predicate: P(x,y):"The resource identified by <x> provides the service <y>". Each row in the table should be an instance (proposition) of that predicate for example the row (100,5) means "The resource 100 provides the service 5." But two say the same things twice, doesn't make it twice as true, so in such a table to have two rows (100,5) (100,5), means to say the same thing twice. And let's say later you want to delete that (because the relation doesn't hold anymore), than an application muight delete one physical instance of the two rows, leaving the other intact (the proposition will still hold true). There are many other potential troublers coming from allowing duplicates in the database, I won't put the whole theory here.

Now, what happens if you make the table with a RowID and no other UNIQUE constraint ? You basically say:

"The Resource <X> provides the Service <Y>, and we store this very proposition in Row <Z>."

Please note that the later part is essentially spurious. It doesn't mean anything for what you try to model, BUT in your case it allows for semantical duplicates:
Resource <100> can provide the Service <5>, "storage" in Row <1000> Resource <100> can provide the Service <5>, "storage" in Row <9000>

Does it mean anything that the same proposition (the rowID not withstanding) is said twice in your system ? Of course not, but potential troubles are plenty.

> > Furthermore, when a program might want
> > to delete a record, it might delete by rowID (only one of the logical
> > duplicates) while from the logical/business rules point of view the
> > system will still contain the reklationship between resource and
> > service when it shouldn't.
> ...and that's where the programmers come in to play. In my detailed
> documentation I always ensure that these things get taken care of.

Well, you're very optimist:) In a complex system documentation always lags behind, it's not well read and so on. But even with this "let the application do it" approach you'll be surprised to find out that application code has to become more complex, if constraints are not enforced in the database.

> > >I could have used the address as part of the primary key, but...
> >
> > You can't. The burden of the physical implementation of a B-tree index
> > might be too great. H
> B-tree? AAAAAAHHHHHH! (Sorry, six years of dealing with Keyfile,
> five years of dealing with Btrieve and two years of dealing with IMS
> have taken their toll.)

Well, I was refering to the way indexes are physiocally created by default in all databases (SQL Server, Oracle , DB2 etc). They are stored on disk in B-Tree structure. Basically the more index keys you can store in a physical disk page (typically 4192 bytes) the more efficient the index is. If you ahve multi-column indexes in natural keys these keys tend to be large and you'll have a lot less keys per disk page, comparing with the situation where your key is just a 4 bytes or 8 bytes integer.

This justifies using surrogate keys to relate tables between them in FOREIGN KEY. Because these primary key indexes are accessed even when an insertion/update is performed in a related table (to check that the FOREIGN KEY constraint holds). However, following the usual optimization rules, this optimization should not be performed, unless there's an identified bottleneck. Most of the tables in a system will do fine even when related by compund key.

> > Having a surrogate primary key shouldn't let you off the hook. As a
> > rule of thumb, you always HAVE TO HAVE an alternate UNIQUE constraint.
> I respectfully disagree with "always have to have." Yes, it may often
> be necessary, but not always.

I hope I was able to convince you with the above example, that it's ALWAYS.
Every table that has a surrogate key and NO OTHER UNIQUE CONSTRAINT declared is asking for trouble, because it can allow duplicate data (duplicate propositions).

> > A good rule of thumb is to always and automatically start your design
> > by identifying natural keys and using them, and only when you discover
> > serious difficulties at the implementation/physical level, only then
> > you introduce surrogate keys, while you keep the UNIQUE constraints.
> > You might want to drop the idea of declaring PRIMARY KEY in most cases
> > and use only CONSTRAINT xxx UNIQUE ...
> Unless I'm wrong, in both MySQL and M$SQL you must identifiy a primary
> key. Something I don't know?

No, all the SQL databases that I know of, allow you to declare a table without any primary key whatsoever, This is a "feature" of SQL standard, in reality it is an awful mistake, and a deviation from the relational model. This alone should disqualify them from using the term "relational databases".

But to your point, all these modern databases (I can't guarantee for the antiquated ones) are happy with UNIQUE constraints (they can be used in foreign keys, query optimization and everything exactly like a PK).

As a matter of fact we can do fine without the concept of PK at all. Because there is no logical reasons to distinguish one among possibly several candidate keys. And if there's only one key, why bother to call it primary?

> > You might want to have a look at:
> > http://citeseer.nj.nec.com/schewe98readings.html
> I've downloaded the PDF and will be reading it on the way home this
> evening.
>
>
> Thank you for your input to this matter. I hope I don't offend
> anybody by offering my warped view on the world.
>
>
> KP

Hope I could help,

Costin Cozianu Received on Sun May 26 2002 - 12:01:59 CDT