abstraction of table relationships

From: Marshall <marshall.spight_at_gmail.com>
Date: 6 Sep 2006 21:57:52 -0700
Message-ID: <1157605072.761613.23760_at_p79g2000cwp.googlegroups.com>



Every once in a while I come across someone who has an idea for how to "improve" relational queries in various ways. Sometimes they have the idea that there is some value to abbreviating or even omitting the relationships between tables to make queries shorter. I just want to go on record as to why this is 1) unnecessary and 2) harmful.

The person I spoke with today proposed that one fault of SQL is that common patterns, particularly in the area of cardinality relationships between table rows, have no abstraction mechanism. That is, we have common patterns in which we express cardinality relationships via structural means within the schema. A one-to-many relationship between table A and table B is expressed as a foreign key from B to A. A many-to-many relationship is expressed as a separate "join table" with foreign keys to table A and to table B. Etc. Learning these simple patterns is an important part of learning how to take good advantage of SQL.

This person went on to say that it would be a good thing if we had some way to abstract these patterns and then refer to the abstractions instead of having to repeat the full pattern. I actually agree with this part, and in some other thread I would like to discuss abstraction mechanisms for use within schemas. However, what I wish to discuss here is his further point. He gave the example of Books and Authors. If they have a many-to-many relationship, he said it would be better if one could query all books by a given author in a simpler way, and have the system find the relationship automatically. So you could, say, simply:

  SELECT * from Books where AuthorId = 'Jane Austen';

even if Author weren't a field of Books. The system would look at the schema for Books, and locate the
"path" in the system whereby it could uniquely attribute

an Author.

First of all, this idea is misguided on the basis of the fact that these "paths" are logical relationships.

But there is a much worse problem with modularity. I rewrote his schema such that there are tables Books and People, and a many-to-many table Authorship. Consider the query analogous to the above, where we query all books by selecting a primary key value in the People table. The "path" between Books and People is unique, so the system can easily fill it in. Fine. But now I come along later and want to record the various people who worked on the book in the role of editor, so I add a many-to-many table for Editorship. And voila! The earlier authorship query will *no longer work* because the system cannot *uniquely* identify a
"path" between People and Books.

If we adopt this idea, then even such currently-harmless schema changes as adding a table can break arbitrary queries, possibly on tables we don't even know about. This increases the coupling within the system, and reduces modularity. Bad! Bad dog! No biscuit!

This problem is actually quite severe. It means that every time we add a table or even an attribute, we have to audit every query in the system to see if it breaks. In a distributed computing world, we may not even have access to all of the client code.

In fact, the better solution to this problem is making better use of natural join. Schema design can and should be done with natural join in mind. Specifically, if there is any kind of key relationship between two tables, then the keys should have the same names, and the table should not have other, non-relating attribute names in common. This is actually quite trivial to do. (It's not possible in *every* case but it is certainly possible in a strong majority.)

At that point, the books per author query becomes:

  Select * from Books natural join Authorship natural join     Persons where PersonId = 'Steven King';

Quite simple. And the editors query is:

  Select * from Books natural join Editorship natural join     Persons where PersonId = 'George R. R. Martin';

In fact, the above queries make quite clear what the inter-table roles and relationships are, whereas omitting those relationships leaves this question quite unclear.

This does not address how to handle multiple foreign keys to the same table. In that case it is necessary to rename the key attribute in the foreign table before joining, and one can do so more than once and it again makes the relationship quite clear. The other problem is accidental attribute "capture" in the natural join conditions. (That is, if two tables being natural-joined have a non-key attribute in common that we didn't notice, we overly restrict our join condition.) This is best handled, again, by avoiding it at schema design time. (But maybe it could be made a warning.)

Marshall Received on Thu Sep 07 2006 - 06:57:52 CEST

Original text of this message