Re: O'Reilly interview with Date

From: Kenneth Downs <knode.wants.this_at_see.sigblock>
Date: Sat, 13 Aug 2005 22:54:41 -0400
Message-Id: <cao2t2-g8v.ln1_at_pluto.downsfam.net>


Paul wrote:

> Kenneth Downs wrote:

>>>OK but if you have to specify the foreign key, why not just specify the
>>>columns in the firstplace?
>> 
>> Ah, this question gets to the heart of things.
>> 
>> First is the dogmatic answer.  The dogmatic answer is that well-designed
>> tables will always join by keys.  The key (or keys) defines the only
>> meaningful linking of the two tables in normal operations.  Therefore you
>> specify the key because that is how they join, it is actually unnatural
>> and downright incorrect to specify columns.

>
> Is that definitely true? I agree that there should be (or could be) a
> constraint for every join you might want to do, but can this be narrowed
> down further to a foreign key constraint?

Well, now you've challenged me to come up with a better answer than "I sure think so", but right now that's all I've got. The reasoning goes something like this.

First we divide all columns into two types. The first is of no interest to the JOIN question, but we'll identify them and get them out of the way. These are purely descriptive columns like a JPEG blob or a VENDOR_REMITTO that has to be put onto checks but is otherwise of no interest. Obviously we would not use these as keys and any joining on them would hardly be a regular course of affairs.

This leaves all of the rest of the columns in the universe. I would contend that each of these columns has exactly one table where this column is the unique key. If this is true, then that means that the column's presence in all other tables is as a foreign key either to its own table or to some intermediate table where it is part of a compound key.

So far this does not prove the point, but it lays the groundwork for the idea that all joinable columns exist in tables by virtue of being foreign keys. The proof we are then seeking is that only the foreign key joins are meaningful. And by meaningful I mean containing rows whose properties are truly attributes of the key of the resulting table.

To disprove the idea, you'd have to have an example of a JOIN that produced true results between two tables where both tables contained the same column but where the tables were not joined by a foreign key. One example might be a job-based workflow system, where all invoices and production orders are tied together through a job control system. Both Sales Orders and Purchase Orders are children of the JOBS table, so each of them contains the JOB_NO column, but they are not related to each other by a key. So you can join them, but to what purpose? You get a meaningless jumble of Sales order and purchase order properties. A more proper relationship between these two tables would be a UNION, which might give the meaningful result of showing activity over time. In fact there is probably another thesis in here that two children of a single parent relate as UNIONs.

Well this is going to bug me until I can provide or find a rigorous proof one way or another.

>

>>>What if you have a non-standard join? For example start_date and
>>>end_date columns and you want to join where some date is between these
>>>two?
>> 
>> This is not actually non-standard.  It only appears so because there is
>> no
>> real support for intervals and so we all code intervals manually.  But if
>> start_date and end_date columns were defined as an interval, then
>> comparing the interval to a single date implies "between", just as a
>> unique constraint on the interval implies "does not overlap", and the
>> interval definition itself generates the constraint end_date >=
>> start_date.

>
> So you're saying there is a kind of lop-sided "equality" defined with
> one operand being a date and the other an interval? And date = interval
> iff the date is contained within the interval?

Almost but not quite.

I am saying that a foreign key constraint defines which values are allowed in a foreign key by examining the primary key of the parent. While simple equality is valid for atomic values, other things like ranges have some wonderful, rigorous, and powerful abilities when expressed in foreign keys.

>
> Is it OK to define equality in a non-symmetrical and non-transitive way
> like this, with different types?

I would ask the question, does it produce correct results? Or perhaps, results that are consistent and predictable, to which the answer is yes.

Consider the statement:

The column SALES_ORDERS.JOB_NO is constrained by the values of JOBS.JOB_NO

and then consider this statement:

The column APPT.DATE is constrainted by the values of interval SHIFTS(DATE_BEG,DATE_END)
>
> I agree a lot of non-standard joins are due to badly-designed schemas,
> but I can think of examples where you might want to define a join with a
> "greater than" clause - for example when joining a table to itself to
> get some kind of ranking number. (How many other salespersons have total
> sales greater than each salesperson?) I can't see how things can be
> designed to get around this.

Hmmm, I actually need an answer to this myself. This is definitely a segue, but how does one express that a Foreign Key forms a hierachy? which is to say that it may be repeating? It can be discovered by analysis because such things lead to circular dependencies in the tables. A hierarchy might be thought of as a deliberate circular dependency. Well, I'm rambling, I'm going to dro this line.

>
> Here's another possible scenario for the date range example:
> I might want to make one join based on the start_date only, and a
> different join based on the start/end interval. Maybe things starting on
> certain days get a price reduction or something. Should this be modelled:
>
> a) as two columns: start_date and end_date? Then you have the problem of
> needing "between" joins as above.

Well to go back to my original idea, a NATURAL JOIN is a JOIN where the key is all that is needed and the query engine can figure out the rest. Seems to me we can say:

UNIQUE on an interval: requires no overlapping values JOIN/FK atomic to interval: atomic must be within interval JOIN/FK interval to interval: Perhaps cannot be defined because it is not transitive. Do we say child must be within parent?

> b) a single interval column? Then you have a problem joining to the
> start date of the interval. The interval is atomic, but you could have a
> start() function that picks out the start of the interval. Even so, this
> ceases to be a standard join.

Not really, it is standard for the interval type. Joining single to interval says single must be within the interval. At least this is what you come up with if you are thinking of expressing intervals as uniques and references.

-- 
Kenneth Downs
Secure Data Software, Inc.
(Ken)nneth_at_(Sec)ure(Dat)a(.com)
Received on Sun Aug 14 2005 - 04:54:41 CEST

Original text of this message