Re: Lookup Tables, the right way?
Date: 14 Mar 2006 06:47:18 -0800
Message-ID: <1142347638.392514.68090_at_j33g2000cwa.googlegroups.com>
Like most new ideas, the hard part of understanding what the relational model is comes in un-learning what you know about file systems. As Artemus Ward (William Graham Sumner, 1840-1910) put it, "It ain't so much the things we don't know that get us into trouble. It's the things we know that just ain't so."
If you already have a background in data processing with traditional file systems, the first things to un-learn are:
(0) Databases are not file sets.
(1) Tables are not files.
(2) Rows are not records.
(3) Columns are not fields.
Modern data processing began with punch cards. The influence of the punch card lingered on long after the invention of magnetic tapes and disk for data storage. This is why early video display terminals were 80 columns across. Even today, files which were migrated from cards to magnetic tape files or disk storage still use 80 column records.
But the influence was not just on the physical side of data processing. The methods for handling data from the prior media were imitated in the new media.
Data processing first consisted of sorting and merging decks of punch cards (later, sequential magnetic tape files) in a series of distinct steps. The result of each step feed into the next step in the process. This leads to temp table and other tricks to mimic that kind of processing.
Relational databases do not work that way. Each user connects to the entire database all at once, not to one file at time in a sequence of steps. The users might not all have the same database access rights once they are connected, however. Magnetic tapes could not be shared among users at the same time, but shared data is the point of a database.
Tables versus Files
A file is closely related to its physical storage media. A table may or may not be a physical file. DB2 from IBM uses one file per table, while Sybase puts several entire databases inside one file. A table is a <i>set<i> of rows of the same kind of thing. A set has no ordering and it makes no sense to ask for the first or last row.
A deck of punch cards is sequential, and so are magnetic tape files. Therefore, a <i>physical<i> file of ordered sequential records also became the <i>mental<i> model for data processing and it is still hard to shake. Anytime you look at data, it is in some physical ordering.
The various access methods for disk storage system came later, but even these access methods could not shake the mental model.
Another conceptual difference is that a file is usually data that deals with a whole business process. A file has to have enough data in itself to support applications for that business process. Files tend to be "mixed" data which can be described by the name of the business process, such as "The Payroll file" or something like that.
Tables can be either entities or relationships within a business process. This means that the data which was held in one file is often put into several tables. Tables tend to be "pure" data which can be described by single words. The payroll would now have separate tables for timecards, employees, projects and so forth.
Tables as Entities
An entity is physical or conceptual "thing" which has meaning be itself. A person, a sale or a product would be an example. In a relational database, an entity is defined by its attributes, which are shown as values in columns in rows in a table.
To remind users that tables are sets of entities, I like to use collective or plural nouns that describe the function of the entities within the system for the names of tables. Thus "Employee" is a bad name because it is singular; "Employees" is a better name because it is plural; "Personnel" is best because it is collective and does not summon up a mental picture of individual persons.
If you have tables with exactly the same structure, then they are sets of the same kind of elements. But you should have only one set for each kind of data element! Files, on the other hand, were PHYSICALLY separate units of storage which could be alike -- each tape or disk file represents a step in the PROCEDURE , such as moving from raw data, to edited data, and finally to archived data. In SQL, this should be a status flag in a table.
Tables as Relationships
A relationship is shown in a table by columns which reference one or more entity tables. Without the entities, the relationship has no meaning, but the relationship can have attributes of its own. For example, a show business contract might have an agent, an employer and a talent. The method of payment is an attribute of the contract itself, and not of any of the three parties.
Rows versus Records
Rows are not records. A record is defined in the application program which reads it; a row is defined in the database schema and not by a program at all. The name of the field in the READ or INPUT statements of the application; a row is named in the database schema. Likewise, the PHYSICAL order of the field names in the READ statement is vital (READ a,b,c is not the same as READ c, a, b; but SELECT a,b,c is the same data as SELECT c, a, b.
All empty files look alike; they are a directory entry in the operating system with a name and a length of zero bytes of storage. Empty tables still have columns, constraints, security privileges and other structures, even tho they have no rows.
This is in keeping with the set theoretical model, in which the empty set is a perfectly good set. The difference between SQL's set model and standard mathematical set theory is that set theory has only one empty set, but in SQL each table has a different structure, so they cannot be used in places where non-empty versions of themselves could not be used.
Another characteristic of rows in a table is that they are all alike in structure and they are all the "same kind of thing" in the model. In a file system, records can vary in size, datatypes and structure by having flags in the data stream that tell the program reading the data how to interpret it. The most common examples are Pascal's variant record, C's struct syntax and Cobol's OCCURS clause.
The OCCURS keyword in Cobol and the Variant records in Pascal have a number which tells the program how many time a record structure is to be repeated in the current record.
Unions in 'C' are not variant records, but variant mappings for the same physical memory. For example:
union x {int ival; char j[4];} myStuff;
defines myStuff to be either an integer (which are 4 bytes on most modern C compilers, but this code is non-portable) or an array of 4 bytes, depending on whether you say myStuff.ival or myStuff.j[0];
But even more than that, files often contained records which were summaries of subsets of the other records -- so called control break reports. There is no requirement that the records in a file be related in any way -- they are literally a stream of binary data whose meaning is assigned by the program reading them.
Columns versus Fields
A field within a record is defined by the application program that reads it. A column in a row in a table is defined by the database schema. The datatypes in a column are always scalar.
The order of the application program variables in the READ or INPUT statements is important because the values are read into the program variables in that order. In SQL, columns are referenced only by their names. Yes, there are shorthands like the SELECT * clause and INSERT INTO <table name> statements which expand into a list of column names in the physical order in which the column names appear within their table declaration, but these are shorthands which resolve to named lists.
The use of NULLs in SQL is also unique to the language. Fields do not support a missing data marker as part of the field, record or file itself. Nor do fields have constraints which can be added to them in the record, like the DEFAULT and CHECK() clauses in SQL.
Relationships among tables within a database
Files are pretty passive creatures and will take whatever an application program throws at them without much objection. Files are also independent of each other simply because they are connected to one application program at a time and therefore have no idea what other files looks like.
A database actively seeks to maintain the correctness of all its data. The methods used are triggers, constraints and declarative referential integrity.
Declarative referential integrity (DRI) says, in effect, that data in one table has a particular relationship with data in a second (possibly the same) table. It is also possible to have the database change itself via referential actions associated with the DRI.
For example, a business rule might be that we do not sell products which are not in inventory. This rule would be enforce by a REFERENCES clause on the Orders table which references the Inventory table and a referential action of ON DELETE CASCADE
Triggers are a more general way of doing much the same thing as DRI. A trigger is a block of procedural code which is executed before, after or instead of an INSERT INTO or UPDATE statement. You can do anything with a trigger that you can do with DRI and more.
However, there are problems with TRIGGERs. While there is a standard syntax for them in the SQL-92 standard, most vendors have not implemented it. What they have is very proprietary syntax instead. Secondly, a trigger cannot pass information to the optimizer like DRI. In the example in this section, I know that for every product number in the Orders table, I have that same product number in the Inventory table. The optimizer can use that information in setting up EXISTS() predicates and JOINs in the queries. There is no reasonable way to parse procedural trigger code to determine this relationship.
The CREATE ASSERTION statement in SQL-92 will allow the database to enforce conditions on the entire database as a whole. An ASSERTION is not like a CHECK() clause, but the difference is subtle. A CHECK() clause is executed when there are rows in the table to which it is attached. If the table is empty then all CHECK() clauses are effectively TRUE. Thus, if we wanted to be sure that the Inventory table is never empty, and we wrote:
CREATE TABLE Inventory
( ...
CONSTRAINT inventory_not_empty
CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... );
it would not work. However, we could write:
CREATE ASSERTION Inventory_not_empty
CHECK ((SELECT COUNT(*) FROM Inventory) > 0);
and we would get the desired results. The assertion is checked at the schema level and not at the table level. Received on Tue Mar 14 2006 - 15:47:18 CET