Re: Help finding natural keys

From: Tim X <timx_at_spamto.devnul.com>
Date: 18 Jan 2003 14:35:49 +1100
Message-ID: <87lm1jnn8q.fsf_at_tiger.rapttech.com.au>


>>>>> "Alan" == Alan Gutierrez <ajglist_at_izzy.net> writes:

 Alan> I've been reading this group and considering the design of my  Alan> database used for management of patient charts.

 Alan> I am generating keys using select max. I've considered what Mr
 Alan> Celko says about how this is just a sequential file model in
 Alan> SQL, but I try as I might, I can't think of good, er, natural
 Alan> keys for my entities.


 Alan> Have I just fallen into a bad habit? Are there candidate keys
 Alan> for things like a person that I just cannot see? Below are some  Alan> examples from my database.
 Alan> create table firm (
 Alan>     firm_id integer not null, firm_name varchar(32), primary
 Alan>     key (firm_id)
 Alan> );

 Alan> The firm may change it's name, so that doesn't make for a good  Alan> key.

 Alan> create table patient (
 Alan>     firm_id integer not null, patient_id integer not null,
 Alan>     primary key (firm_id, patient_id), foreign key (firm_id)
 Alan>     references firm
 Alan> );

 Alan> A patient is a charge of a particular hospital so, I select the  Alan> max patient id for the hospital to get a new patient id.

 Alan> create table questionnaire (
 Alan>     firm_id integer not null, questionnaire_id integer not
 Alan>     null, primary key (firm_id, questionnaire_id), foreign key
 Alan>     (firm_id) references firm
 Alan> );

 Alan> Much of the application data is stored in questionnaires. These
 Alan> map to the many forms that are filled out and added to a chart,
 Alan> assessments, plans of care. There is no unique identifier for
 Alan> these things in the real world where they are photocopied Word  Alan> documents designed by a nurse. Again, select max.
 Alan> create table completed_questionnaire (
 Alan>     firm_id integer not null, questionnaire_id integer not
 Alan>     null, completed_questionnaire_id integer not null, primary
 Alan>     key (firm_id, questionnaire_id,
 Alan>     completed_questionnaire_id), foreign key (firm_id)
 Alan>     references firm
 Alan> );

 Alan> Here is an instance of a questionnaire. A questionnaire might
 Alan> be completed a number of times during a patient's care, like a  Alan> progress report form. Again, select max.

I think there are two issues raised here

  1. Selecting what to use for your primary key - surrogate or natural
  2. If surrogate, how to generate that key.

There is a lot of debate on which is better surrogate or natural. My experience is many developers opt for a surrogate key because its easier. However, I tend to feel that difficulty in identifying a natural key is often an indicator of weakness in your model. While it is true you sometimes don't have the necessary attributes in an entity to identify a suitable natural key, I think you should always try to first and only use a surrogate key when no suitable candidates are found. If I can't identify a suitable natural candidate for the primary key, I will often add a unique id attribute and add that to the best natural key candidate (assuming its not a concatenation of all the attributes), rather than just using the unique id as the primary key.

The bottom line, there are no guaranteed techniques for selecting surrogate vs natural keys, but inability to identify acceptable candidates for a natural primary key is possibly an indication your model is either representing an entity which is really a combination of one or more distinct entities or your entity is missing important attributes.

Entities representing humans can be difficult because the thing we immediately think of as the natural key is a persons name. However, names are not unique. Sometimes you can add additional entities about the person which will provide sufficient uniqueness for your domain. For example, first_name+last_name+date_of_birth may be sufficient (if you added birth time it might even be better, but how many people know the time of their birth?) Many countries assign numbers like social security numbers or medicare numbers to everyone and that may be sufficient. However, in some domains, this won't be enough (what do you do if your patient is from overseas or if your government recycles SSN/Medicare numbers after X years of a persons death etc). The thing about a primary key is it only has to be unique within your problem domain, not necessarily unique within all problem domains which need to model that type of entity. If you do not have overseas patients and if you don't need to keep patient records after a patient has died, maybe SSN/Medicare numbers will do. If your domain is small, maybe there will be very few patients with the same name and none with the same name and date and/or place of birth etc might be sufficient. Identify and clearly define your problem domain and then identify your entities within that domain and then try to identify as many candidate natural keys as possible - select the best one and if none cold be found, re-evaluate your entities, domain etc and if still none can be found, consider a surrogate key.

Even when you find you need to resort to a surrogate primary key, you will still need some sort of key to identify the patients. For example, if your primary key was just an artificially generated surrogate key, how will you then determine which patient called Joan Smith you are dealing with - its almost certain nobody will know what their patient number is within the system, so you will still need some sort of additional information to identify that patient.

I think similar hold true for the firm information - maybe there is a business or tax number which a firm keeps even if they change their name. If a firm is purchased by another firm, is that the same firm or a new firm - again, lots depends on what your real domain is and how things need to be modelled within it.

What I found of far more concern was the way you generate your surrogate keys. Determineing a new primary key by looking at the largest current primary key and adding one to it is very dangerous. The big danger is that first selecting the maximum value from a field, adding some amount to it and then inserting the resulting key back into the table is not an atomic action. This means it is possible for another process to query the table between the first process making its query and before it does its insert - the possible result is duplicate values. While this will cause some sort of insert error is the field is the only field which makes up the primary key, it may not if your primary key is a concatenation of this field with other fields. So, you could end up with duplicate values in a column which you expect to only have unique values or you are going to have inserts which fail and need to be re-applied. Either case is undesirable.

Most commercial (and even most free) DBMS have some facility for generating id numbers which are guaranteed to be unique (e.g. sequence). If your system has such a facility, use it rather than the max(some_column)+1 approach. Apart from guaranteeing your id numbers will be unique, it will probably be faster than doing a max() lookup followed by an insert.

Tim

-- 
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you 
really need to send mail, you should be able to work it out!
Received on Sat Jan 18 2003 - 04:35:49 CET

Original text of this message