Re: Primary key on one or two fields

From: --CELKO-- <joe.celko_at_northface.edu>
Date: 8 Aug 2003 12:56:30 -0700
Message-ID: <a264e7ea.0308081156.4cd72b48_at_posting.google.com>


>> I'm trying to fully separate the data into its three components,
hence three tables. <<

Components are part of a single whole, so that it confusing.

>> This makes it easier to code the front END AND ensure I don't END
up with invalid data (e.g. A rate_id AND rate name that don't match up BETWEEN different rates). I am more interested in correctness rather than speed here, which is unlikely to be much of an issue. <<

THEN why use IDENTITY which will destroy data integrity?

>> OK, this is my current proposed structure:<<

Let's look at it one table a time.

1)
CREATE TABLE Rates

(rate_id INTEGER IDENTITY(1,1), 
 rate_name VARCHAR(20) DEFAULT ‘{{unknown}}' NOT NULL,
 rate_type INTEGER DEFAULT 1 NOT NULL, 

PRIMARY KEY (rateid));

Minor point: the correct SQL-92 syntax to have the DEFAULT clause after the data type, NOT after the NOT NULL. Why avoid portable code?

Major points: IDENTITY cannot ever be a key by definition AND trying to make it into one guarantees loss of data integrity. If you want a NUMERIC rate identifier, use one with a check digit OR something that can be verified in the real world, NOT the internal physical strate of a piece of hardware at insertion time.

2)
CREATE TABLE Rate_Revisions
(revision_id INTEGER IDENTITY(1,1),
 rate_id INTEGER NOT NULL,
 rate_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL, PRIMARY KEY (revision_id))

See above remarks about IDENTITY.

Minor point: The DEFAULT clause again, but why NOT one on the rate_date

Major flaw: time is a continuum, Not a set of points! That time stamp needs to be part of the key!! Remember Einstein? To exist, an entity must have a duration.

Let's kill the Rate_Revisions table right, the same way I can destroy everything in this schema.

INSERT INTO Rate_Revisions (rate_id) VALUES (12); INSERT INTO Rate_Revisions (rate_id) VALUES (12);  . . . << repeat 10,000 times >>
INSERT INTO Rate_Revisions (rate_id) VALUES (12);

Now try to get meaningful results with all the redundant data. More fun? Drop a row, which should model one fact in a proper schema, and tell me the durations of the rates that were in effect before and after the deleted row. With a fact missing, you should get a result that tells you that data is missing, but instead you get a false answer!

3)
CREATE TABLE Rate_Revision_Employees
(rate_revision_id INTEGER NOT NULL,
 employee_id INTEGER NOT NULL,
 rate FLOAT NOT NULL,
 PRIMARY KEY (revision_id, employee_id));

Since the Rate_Revisions table is garbage now, this is moot. I gather that this is a list of the employees who had something to do with a rate change?

>> If I want to see current rates I can use: ..

… a horrible nested query expression that will run like glue. That is the result of changing the nature of time from duration to points.

4) CREATE TABLE Rate_Revisions

(rate_revision_id INTEGER, 
 rate_id INTEGER, 
 rate_date DATETIME, 

PRIMARY KEY (rate_revision_id, rate_id));

Major point: you cannot have a PRIMARY KEY with nulls.

>> I was under the impression that getting the database to generate my
unique keys was a good thing, … <<

No! Have the data model **show** you the keys. You need to verify them in the real world or you have no data integrity. It gets worse in a multi-user environment, where there is more chance of an input error.

>> … and I can't say I understand your reasons for not liking them
other than not being standard. <<

Isn't that enough? Or do you start all your code a comment that "This thing will not port, be readable to people who don't know this proprietary dialect or who follow a relational design." None of that seems good to me. I already demonstrated how it destroys data integrity.

Let's define what characteristics you want in an identifier.

  1. An identifier has to be unique. Otherwise, it is not an identifier.

A personal example of this happened to me for years. I was named after my father and we were constantly getting each other's mail, phone calls and so forth because people would assume that such an unusual name could only belong to one person. I finally legally changed my first name from "Joseph" to "Joe" to avoid the confusion.

2) An identifier should be created with the entity, or before the entity exists, but not afterwards. Without an identifier, you simply cannot put an entity into the database.

You might hold the "pre-entity" outside of the system for a short time, however. Imagine a situation where a salesman take an order at the customer site, but has to call his company to get a confirmation number that the customer then uses to reference the transaction. Until that confirmation numbers is assigned, there is no transaction.

As an example of an identifier coming into existence before the entity, think about a book that has not been published yet. It can be assigned an ISBN (International Standard Book Number), a title, an author, a price, and everything else while the publisher is waiting to get the manuscript.

But a future release book does not behave like a real book. You cannot put it in a box and ship it. You cannot get a review of the book either -- at least not an honest review. It is not the same kind of thing as a published book.

3) It should be verifiable within itself. That means that when I see a particular kind of identifier, I ought to know if it is syntactically correct. For example, I know that ISBN 0-486-60028-9 has the correct number of digits and that the check digit is correct for a proper International Standard Book Number. Later on I can find out that it identifies the Dover Books edition of AN INVESTIGATION OF THE LAWS OF THOUGHT by George Boole.

4) An identifier should have repeatable verification against the reality that you are trying to capture in your data model.

Exactly what verification means can be a bit fuzzy. At one extreme, prison inmates are moved by taking their fingerprints at control points and courts want DNA evidence for convictions. At the other end of the spectrum, retail stores will accept your check on the assumption that you look like your driver's license photograph.

5) Prefixes like "tbl-" in front of a table name is called Hungarian notation. This is a really bad idea. Would you write "(noun)_I (verb)_disagree (preposition)_with (determiner)_the (noun)_concept. "in English? All it does is expose the PHYSICAL storage in a language that uses LOGICAL models. You gain no new information and destroy the readability of your code. The PHYSICAL storage location or method has nothing to do with the LOGICAL meaning of the data element. Would actually write "top_file_drawer_personnel" if you stored personnel data in a vertical filing cabinet?

If you look at the Microsoft supplied system stored procedures you will see that Microsoft as a rule does not use it in their own production level T-SQL code.

The rules I like to use are:

  1. Give tables names that are plural or collective nouns that identify the class they represent. For example, Employee is bad (unless your company only has one), Employees is better and Personnel is best.

A suggestion from another regular on the SQL Server newsgroup, is that as one grows older, one's eyes deteriorate. It is easier to read names with "white" space in them, so I personally prefer names like Customer_Groups to CustomerGroups.

For column names, I try to stick to the INCITS L8 Metadata Standards committee rules as given in the ISO-11179 Standard. In particular, Section 11179-4 has a good simple set of rules for defining and naming a scalar data element. A data definition shall:

  1. be unique within the entire schema, not just within a table. Why do people declare the primary key of a table named Foobar to be "id", then when it is referenced in a second table, re-name it "Foobar_id"?? What is the logical difference between the two attributes?
  2. be stated in the singular. Remember that columns are scalars, not sets.
  3. state what the concept is, not only what it is not
  4. be stated as a descriptive phrase or sentence(s)
  5. contain only commonly understood abbreviations
  6. be expressed without embedding definitions of other data elements or underlying concepts

The document then goes on to explain how to apply these rules with illustrations. There are three kinds of rules that form a complete naming convention:

  • Semantic rules based on the components of data elements.
  • Syntax rules for arranging the components within a name.
  • Lexical rules for the language-related aspects of names. 6) Name development begins at the conceptual level. An object class represents an idea, abstraction or thing in the real world, such as tree or country. A property is something that describes all objects in the class, such as height or identifier. This lets us form terms such as "tree height" or "country identifier" from the combination of the class and the property.

The level in the process is the logical level. A complete logical data element must include a form of representation for the values in its data value domain (the set of possible valid values of a data element). The representation term describes the data element's representation class. The representation class is equivalent to the class word of the prime/class naming convention many data administrators are familiar with. This gets us to "tree height measure", "country identifier name" and "country identifier code" as possible data elements.

There is a subtle difference between "identifier name" and "identifier code" and it might be so subtle that we do not want to model it. But we would need a rule to drop the property term in this case. The property would still exist as part of the inheritance structure of the data element, but it would not be part of the data element name.

Some logical data elements can be considered generic elements if they are well-defined and are shared across organizations. Country names and country codes are well-defined in ISO-Standard 3166, Codes for the Representation of Names of Countries, and you might simply reference this document.

Note that this is the highest level at which true data elements, by the definition of ISO-11179, appear: they have an object class, a property, and a representation.

The next is the application level. This is usually done with a quantifier which applies to the particular application. The quantifier will either subset the data value domain or add more restrictions to the definition so that we work with only those values needed in the application.

For example, assume that we are using ISO-3166 country codes, but we are only interested in Europe. This would be a simple subset of the standard, but it will not change over time. However, the subset of countries with more than 20 cm of rain this year will vary greatly over time.

Changes in the name to reflect this will be accomplished by addition of qualifier terms to the logical name. For example, if an application of Country name were to list all the countries a certain organization had trading agreements with, the application data element would be called Trading partner country name. The data value domain would consist of a subset of countries listed in ISO-3166. Note that the qualifier term trading partner is itself an object class. This relationship could be expressed in a hierarchical relationship in the data model.

The physical name is the lowest level. These are the names which actually appear in the database columns and so forth. They may be abbreviations or use a limited character set because of software restrictions.

7) After we have a name, we need to worry about units of measure and scales. For example, a table with tires in it could have a column called "tire_width" -- I would need to know that it is in centimeters and not inches before I put data into it.

How about this schema? I alloow for multiple employees to be involved with a rate change, capture the history, prevent redundancies and all identifiers can be verified by reality.

CREATE TABLE Rate_Revisions
(rate_name VARCHAR(20) DEFAULT ‘{{unknown}}'

            NOT NULL PRIMARY KEY,
 rate FLOAT NOT NULL,

 rate_type INTEGER DEFAULT 1 NOT NULL
 rate_start_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL, 
 rate_end_date DATETIME, 

 CHECK (rate_start_date <= rate_end_date),  PRIMARY KEY (rate_name, rate_start_date));

CREATE TABLE Rate_Revision_Employees
(employee_id INTEGER NOT NULL

     REFERENCES Personnel (employee_id),  rate_name VARCHAR(20) NOT NULL,
 rate_start_date DATETIME NOT NULL,
 FOREIGN KEY (rate_name, rate_start_date)  REFERENCES Rate_Revisions (rate_name, rate_start_date)

      ON UPDATE CASCADE,
 PRIMARY KEY (employee_id, rate_name, rate_start_date)); Received on Fri Aug 08 2003 - 21:56:30 CEST

Original text of this message