Database Design Guidelines and Rules of Thumb

Thomas B. Cox's picture
articles: 

This article discusses the various types of primary keys and list the advantages and disadvantages they provide.

Primary Keys

You have three choices for defining the primary key for most tables: a 'natural' key, a system generated or 'internal' key, and (for smaller lookup tables) a mnemonic code.

Internal ID numbers

Rule of thumb: for any table over 100 records, use a system-generated number as the primary key.

Any time your system stores a record, particularly when the record's 'natural' key ID number is controlled by an outside source, there is a risk that the outside source may do odd things with its data - it may re-use the ID number, it may update a record by changing the ID, or the system may have a problem talking to the outside source and may get the same record sent twice. In every such case, your system must not be tightly coupled to the outside source - where "tightly coupled" means that a problem in the outside source system causes your system to fail to work correctly.

Therefore, as a form of humility, your system should accept at face value whatever the outside sources send. The system should not try to fix other systems, nor modify their data, nor maintain referential integrity for them. (Your system should do all of those things for its own internally created data, but not for other systems' data that is imported.)

When other systems do make corrections to earlier errors, your system should take the later data at face value and overwrite any earlier records they sent. In every case, your system should faithfully and accurately record whatever is sent from other systems.

This approach can only be made to work with system generated ID numbers. Other keys could be created instead, such as the external system's primary key plus (say) the date, but this merely tries to obtain the benefits of humility without acknowledging the work, and exposes your system to problems if any of the above mentioned problems were to occur within the course of a single day (when the dates would be duplicated).

(On an earlier system I tried the external-key-plus-datetime approach, only to find that duplicates arrived within the same datetime down to the second -- one second being the finest granularity of an Oracle datetime.)

Note: this does NOT mean your system allows users to enter bad data. Data that users enter directly in the system is part of the system's own data, and therefore it takes responsibility for that data's cleanliness and accuracy. All data entry screens will provide and enforce "edits" to ensure the correctness of data entered by users directly into your system.

Natural Keys

Rule of thumb: don't use natural keys as primary keys; instead, store them as attributes.

I view 'natural' keys as actually just the keys of an external system -- even if that external system is the outside world. Generally, natural keys are untrustworthy in the extreme. Here are some examples of their untrustworthy behavior:

A criminal steals someone's (US Government generated) Social Security Number (SSN) to commit fraud (now two people have one SSN). Later, the victim is assigned a new SSN (now one person has had two SSNs at different times.) A system that used SSNs as a 'natural' key would have a terrible time with this scenario. A visitor to the US (or an American child) who has no SSN cannot be tracked by the system, and essentially doesn't exist.

A supplier re-uses SKUs (Stock Keeping Units), i.e. when a product becomes obsolete its SKU is reassigned to a new product.

You'll still want to store the supplier SKU -- you just need to be alert to its potential for misbehavior.

Mnemonic Codes

Rule of thumb: for any table over 100 records, use a system-generated number as the primary key.

For lookup tables, I like to take advantage of the fact that primary key values of lookup tables become foreign key values in the tables that reference them -- and if a code is mnemonic (its meaning is easily remembered) then often the user doesn't even need to do a join to know what the key means.

For example, consider a list of Activity Types:

CODE	NAME
CANCEL	Cancel
CASH	Cashed
ERROR	Error
ISSUE	Issued
RQCANCEL	Request Cancel
RQSTOP	Request Stop
STOP	Stop

Here the Name column can contain a nice, human-readable name for the Activity Type, while the Code column contains a shorter (and thus less wasteful of disk space) code that is still readable by most users.

I like to make CODE columns at least 8 characters long. Shorter keys can get pretty cryptic, and often end up being arbitrary and hard to remember, because there aren't enough letters available to make the code look like the name.

Compound Primary Keys

Rule of thumb: for any table over 100 records, use a system-generated number as the primary key.

I generally dislike compound primary keys, and try to avoid them in almost all cases. The few exceptions will be for master-detail relationships that are very tight, and where the parent is accessed frequently while the child is accessed less often and/or mostly in the context of its parent.

For example, an order and its line items are closely tied, so the line item primary key can be the order key plus a sequence-within-parent. In these cases, place the parent key first and the child key second, to improve index performance, and use a generated number rather than (say) a datetime for the sequence-within-parent value.

Generic Entities and the Extended Attribute

At <CLIENT>, for a while the team considered adopting a "generic" check data model, and the use of a single Check Extended Attributes table to hold various different check data elements (vendor number for Expense checks, claim numbers for Claims checks, group number for Premium Refund checks, etc.).

The business requirement here was that the data needed to be stored and reported upon for seven years, yet, there were parts of the data structure that would be changing over the course of that period of time. Other parts of the data structure the team believed would not change much (or any). For example, each check has a check number, bank account, payee, date, and amount -- these the team assumed would not change. By contrast, there are currently six check types (seven if you count Comp and Med as two types of Claims checks), but the team have no guarantee that another check type might not be added in future. For a given check type, such as an Expense check, it is desirable currently to track certain data elements, but the list of interesting data elements for an Expense check may change in a few years.

This means that rather than two categories of data, namely Application Data (which varies) and Application Structure (which doesn't vary), there are three categories: both of these previous two, plus Application Structure That Slowly Varies.

In general, a generic approach can be useful in those cases where analysis cannot be completed before building must begin, or, where structures really will change substantially and regularly over time.

An example of structure that changes substantially and regularly over time is a complex product catalog I built for an online retailer. Each product would belong to a Product Type, but each Type could have a variety of attributes that were peculiar to it, and different from every other Product Type -- a coffee cup might come in two colors, while a T-shirt might come in both colors and sizes. Both would have a Weight, a Price, and so forth -- these would be normal attributes -- while the attributes that are optional or peculiar to a particular product type would be stored as Extended Attributes.

Why the Check Extended Attribute was Abandoned

Despite the case made for a generic data model, a competing issue was raised: that the staff who would support this application, and the processes and procedures that the staff would use, were not prepared for the complexity and abstraction required to support a generic check model. That being the case, and rather than predicate the success of the project on the re-education of the support staff and the re-engineering of support procedures, the team decided to revert to a more explicit model.

Other Problems with the Extended Attribute

Lest the Extended Attribute sound too easy, I should point out some of its drawbacks:

  • datatype problems -- it's tough to mix varchar, date, and number data
  • data dictionary becomes fragmented
  • joins and queries become very complex
  • DBA has to be specially trained
  • Data Admin function becomes even more vital -- bad data can cause system to halt

See the Generic Checks example at the end of this paper.

Control Columns

If your system imports and exports data using batch jobs, then several of your tables will be candidates for what are often called "control columns" - these are columns that contain data about where a given record originated, when and how it was loaded, etc.

Typical control columns would be: system of origin, batch job that loaded this record, batch job that updated this record, datetime of record creation or update, userid of creation or update.

State Tables

A 'state table' is a table that records the changing states of data over time -- rather than overwriting old records when they change, a State Table will insert a new record that supersedes the old one. As an example, consider the <CLIENT> Banking System's "MICR Number" table.

The MICR Number entity is treated with some skepticism. For example, the Check Type should properly (in a just and virtuous world) be a parent table to the MICR and the MICR would then parent the Check table, so that the Check derives its Type by way of the MICR. The system doesn't do that. Instead the Check Type has a relationship directly to the Check, bypassing the MICR.

This is largely because the MICR is so entirely a creation of outside forces, over which your system has no control. There is a strong need to isolate those sorts of tables, and keep them away from the key areas of the model, such as Check Type, etc. Furthermore, the MICR numbers themselves are subject to periodic redefinition, and are not even unique at a given point in time (MICR 0050 appears in two different checking accounts -- a complication we leave out of the example below, lest it become even more difficult to follow).

"A valid-time state table records the history of the enterprise. Such a table is easily specified by appending two timestamp columns, one specifying when the row became valid and one specifying when the row stopped being valid. The intervening time is termed the 'period of validity' of the row."
-Snodgrass[1] page 110.

[1] Developing Time-Oriented Database Applications in SQL, by Richard T. Snodgrass, 2000, Morgan Kaufmann Publishing, San Francisco.

In our case, the two timestamp columns are START_DATE and END_DATE.

For a given MICR value (such as 2050), there may be multiple rows in the MNH table with that value, but only one row will be valid at any given moment. Here is an example:

ID  MICR   START_DATE   END_DATE     MEANING
--  -----  -----------  -----------  ---------
 1  2050   01-JAN-1992  01-JUL-1997  First Meaning
 2  2050   01-JUL-1997  01-JAN-2002  Second Meaning
 3  2050   01-JAN-2002  31-DEC-9999  Third Meaning

Note that each row's END_DATE is the START_DATE of the next row. This is called the Closed-Open approach, where the End Date value is excluded from the period of validity.

The table design requires that the CHECKS table be joined to MNH by way of the ID column, not by the MICR value. There are a couple of reasons for this.

One is, if the CHECKS table carries the MICR and tries to join using it, we get not one but three MNH rows that match (in this case) -- not good. The MICR value isn't a foreign key.

The other reason is that the alternatives are worse: one could join from CHECKS using a MICR value plus the CHECK_DATE, but the join condition then becomes both complicated and non-intuitive ("WHERE c.micr = m.micr AND c.check_date BETWEEN m.start_date AND m.end_date" instead of simply "WHERE c.mnh_id = m.mnh_id"). The risk of the casual user making an invalid join, and getting data back that is simply wrong, is high.

Another alternative is to ignore earlier and later MICR rows and only store one MICR value in the MNH table. This is unacceptable because we lose the data in the MEANING column, and checks issued in years past will be erroneously joined to the current row.

By joining Checks to MNH by way of the MNH_ID, the system can hold seven years worth of checks, and therefore seven years worth of MNH values (redefined periodically over the course of time), and each check knows which version of which MICR it belongs to.