Re: Normalization, Natural Keys, Surrogate Keys

From: Costin Cozianu <c_cozianu_at_hotmail.com>
Date: 30 May 2002 09:18:55 -0700
Message-ID: <2cf20de8.0205300818.6fcb66c4@posting.google.com>

Glad that I made you think more, on the subject. However I'm sorry to see that you got the wrong conclusions :) We can always agree to disagree though.

The essential problem that needs to be solved in a good relational design as well as in a good OO design (that's my bread and butter) is not to create primary keys, secondary indexes and so on, so forth. Those are secondary. You might as well forget about those.

Many people have come to wrongly believe that their duty is to declare a primary key for each table, and foreign keys wherever necessary, and that's basically it.

The essential problem that needs to be solved is the IDENTIFICATION PROBLEM (it was treated a bit more thorouighly in the reference I sent you). In a design we model whatever we think are the essential characteristics of objects/entities in the real world. The relational databases use a framework of logic (as I pointed you previously in this thread) to realise this modeling.

The identification problem is basically that of being able to uniquely map the real entities of the reality being model to the objects of our modeling framework. If we were using object databases, those "objects" are real programming objects, in a relational database the modeling objects are logical propositions about the real entities and their properties.

To this end, the identification problem, surrogate keys are totally useless. Take this example, you have 2 users in the system, Costin and Kai, both of them have been inserted with SSN:111-11-1111 (probably one of them is a fraud, or one of them misspelled his SSN), but different RowID.

Now the system will want to do business (either one of us comes up to the front desk, which is a happy case, because the operator might check manually other information, or data has to be processed coming from related system), with a person defined by SSN:111-11-1111.

In the happiest case, the system will refuse to perform the transaction (the right thing to do), but most likely the transaction will go on and more incorrect data will be propagated to the system. Quite possible, given the application frameworks of today a SELECT ... WHERE SSN=111-11-1111 will be issued, and the first row that shows in the results set, will be taken by RowID and the transaction will be performed as nothing happens.

Even, if the developer is aware to deal with this situation (quite unlikely), in many cases he just can't. The data is coming from external systems, identified precisely by SSN. Even if it comes with an additional first name, last name, these are often misspelled. It happens to me, as it happens to everybody. So a developer who wanbts to connect your system with external systems for the prupose of doing transactions, just can't.

Even for a front desk operator is awkward to say the least, for him/her to input SSN:111-11-1111 and to be presented with a list of 2 (or more persons). Then he should ask: ARE YOU Costin Cozianu SSN:111-11-1111, or ARE YOU Kai Ponte SSN:111-11-1111 ?? And select accordingly. I'd run away from doing business with a bank that asks me suich a question. If it was state agency, probably I'd have no choice.

Not to mention again, the extreme likelyhood that whatever framework (VB, Delphi, hand crafted EJB/JSP) used will stop at the first row retrieved from the database so the WRONG person will be selected.

Now keep this in mind for the prupose of discussing your design.

cybermusicdude_at_aol.com (Kai Ponte) wrote in message news:<c1ec9b8f.0205290820.706424bc_at_posting.google.com>...
> c_cozianu_at_hotmail.com (Costin Cozianu) wrote in message news:<2cf20de8.0205260901.19582f0a_at_posting.google.com>...
>
> > For example, in the above, because you allow for that 999999999 for
> > situations that are very rare (let's say <1% or <5% of the cases), you
> > fail to protect the vast majority of cases, because you won't be able
> > to declarer the UNIQUE constraint on the SSN. There are better way to
> > deal with this, one example that I suggested is create an
> > SSN_Exceptions table where you put the data collected for the persons
> > who don't have a SSN or their SSN has problems.
> >
> > Then, you rely on trhe app programmers to check (for duplicates)
> > before insert, most of whom will probably not do becausde they've been
> > indoctrinated by objects and object identity, and worse, they might
> > use Object/Relational frameworks (even worse, Enterprise Java Beans)
> > that work on cruise-controle, they're happy with the artificial
> > identity (RowID in this case). So don't be surprise if duplicate SSNs
> > will appear.
> >
> > The right way to do it is to have an UNIQUE constraint, the
> > application will try to insert the data (without checking for
> > duplicates), and the database will fail the INSERT in case of invalid
> > data, then the application can catch the exception and follow the
> > business rules for dealing with these exceptions.
> >
>
> Okay, you've made me think. (I hate it when that happens.) I've
> discussed this issue with some of my co-workers and here's what I've
> come up with.
>
> 1. Yes, having a unique column (or group of columns) in the table is
> beneficial. This is done even if the column(s) isn't (arent) the PK.

Having an UNIQE constraints defines the solution to the identification problem.
If this problem is not solved, the design should be qualified as bad.

> 2. Allowing the DBMS to catch the duplicates and failing on an insert
> is a benefit when programmers don't do their due dilligence. (That
> never happens, right?)
>
> HOWEVER...
>
> I thought about the exception table idea and I believe it is a bad
> design in most instances that I can forsee. (Remember this is just my
> narrow-minded opinion.) Here's why. Let's say, as in the examples
> we've been using, we have a table which includes a "supposedly" unique
> SSN. Let's say too that we want the DBMS to catch any duplicate SSNs
> and not allow for an INSERT. You suggest that an exception table be
> created where the duplicates reside. I can see how this would benefit
> the enforcement of a uniqe column.
>
> The problem that I see is in how to deal with the exceptions. In a
> real-world situation, the exceptions may never get handled and they
> will live in the "exeption" table forever. Thus you end up with
> "duplicate" data in two tables. As a result, problems will crop up
> with foreign key relationships since they will need to be related to
> both the primary and exeption tables.
>
> Now, let's say we setup a design with only one table as so:
>
> RowID(PK), SSN, Occurance
>
> We still use the surrogate key to maintain a separation of the
> business data from the primary key.
> The SSN column will allow
> duplicates. The new column added is the occurance column which is set
> to zero unless a duplicate SSN is entered. Our UNIQUE constraint is
> the combination of the SSN and occurance. Using my example with data
> we would see something like the following:
>
> RowID SSN Occurance
> 1 123-45-6789 0
> 2 123-43-8989 0
> 3 123-45-6789 1

Therefore your SSN is still alowing duplicates, and your Occurance column solved absolutely nothing. Do you expect me to come at the front desk, and say I am the Occurance:0 of SSN:123-45-6789 ? Occurance is bogus, it contains no real information, the table rows are already distinct by virtue of RowID column.

So Occurance solves nothing, just it allows you to define an alternate UNIQUE constraint. However this wasn't the goal, the goal was to solve the identification problem which remains unsolved. And still, your system allows two distinct persons to coexist with the same SSN. Your data quality is down the drains. What is worse, before you are able to take whatever measures (which can't be prgramatical, they have to be a painful manual process), the errors might propagate to other systems.

> The data still is allowed to be entered and it exists in only one
> table. The occurance flag is incremented each time a duplicate SSN is
> entered. Programming can be created to deal with the exceptions and
> provide for a clean up when time permits.

That's terrible. Programs will not be able to deal with exception even when time permits (most likely never). A program will make a decision based on what ? Based on Occurance ? It can't , because Occurance contains 0 infromation is just an arbitrary number discriminating 2 or more physical row in a table which may have the same SSN (as opposed to the logical proposition about the entity being modeled). We already have RowID for that. The SSN was the REAL information, but you allowed for duplicates.

>(Read Steven Covey's parable

> on sharpening the saw to determine when this happens.) This design
> allows for the exeptions you mentioned without forcing the DBMS to
> maintain a second table.

The second table was just a convenience. The more simple and drastic solution would be simply to forbid somebody doping business with you, if someone else in your records has the same SSN, at least until the situation is clarified. And when the situation is clarified, one way or the other, the only possibility that's valid is two distinct persons have distinct SSNs. That's the bottom line.

The person in cause will even thank you for helping him discover an identity theft, or having his records (driving history, credit history, etc, etc) being mixed up with another person's.

Using an exception table, you can still collect some data about that person and later use it.

> I think this thread and the discussions within have reinforced the
> idea that using a surrogate key is not only a better design for
> relational databases but is flexible enough for any situation. I want
> to thank all of those who have contributed your thoughts so far.
>
> Kai Ponte

No, using surrogate keys is terrible for a logical design and the flexibility comes only with sacrficing data quality goals. The only considerations that justify surrogate keys, is the inability of RDBMS to efficiently manipulate at the physical level (dsik layout) keys that are multiple columns and big in size. Also, no database that I know of has an otpion like ON UPDATE CASCADE for foreign key to allow for update propagations.

But feel free to disagree.

Costin Received on Thu May 30 2002 - 11:18:55 CDT