Re: abstraction of table relationships

From: Alexandr Savinov <spam_at_conceptoriented.com>
Date: Thu, 07 Sep 2006 11:34:21 +0200
Message-ID: <edop32$mea$1_at_f1node01.rhrz.uni-bonn.de>


Marshall schrieb:
> Every once in a while I come across someone who has an idea for
> how to "improve" relational queries in various ways. Sometimes they
> have the idea that there is some value to abbreviating or even omitting
> the relationships between tables to make queries shorter. I just want
> to go on record as to why this is 1) unnecessary and 2) harmful.
>
> The person I spoke with today proposed that one fault of SQL is
> that common patterns, particularly in the area of cardinality
> relationships between table rows, have no abstraction mechanism.
> That is, we have common patterns in which we express cardinality
> relationships via structural means within the schema. A one-to-many
> relationship between table A and table B is expressed as a foreign
> key from B to A. A many-to-many relationship is expressed as
> a separate "join table" with foreign keys to table A and to table B.
> Etc. Learning these simple patterns is an important part of learning
> how to take good advantage of SQL.
>
> This person went on to say that it would be a good thing if we
> had some way to abstract these patterns and then refer to
> the abstractions instead of having to repeat the full pattern.
> I actually agree with this part, and in some other thread I
> would like to discuss abstraction mechanisms for use within
> schemas. However, what I wish to discuss here is his
> further point. He gave the example of Books and Authors.
> If they have a many-to-many relationship, he said it would
> be better if one could query all books by a given author
> in a simpler way, and have the system find the relationship
> automatically. So you could, say, simply:
>
> SELECT * from Books where AuthorId = 'Jane Austen';
>
> even if Author weren't a field of Books. The system
> would look at the schema for Books, and locate the
> "path" in the system whereby it could uniquely attribute
> an Author.
>
> First of all, this idea is misguided on the basis of the
> fact that these "paths" are logical relationships.
>
> But there is a much worse problem with modularity.
> I rewrote his schema such that there are tables
> Books and People, and a many-to-many table Authorship.
> Consider the query analogous to the above, where
> we query all books by selecting a primary key
> value in the People table. The "path" between
> Books and People is unique, so the system can
> easily fill it in. Fine. But now I come along later
> and want to record the various people who worked
> on the book in the role of editor, so I add a
> many-to-many table for Editorship. And voila!
> The earlier authorship query will *no longer work*
> because the system cannot *uniquely* identify a
> "path" between People and Books.

Yes, you are right. It is a problem of such general queries because they assume some certain structure of relationships.

> If we adopt this idea, then even such currently-harmless
> schema changes as adding a table can break arbitrary
> queries, possibly on tables we don't even know about.
> This increases the coupling within the system, and
> reduces modularity. Bad! Bad dog! No biscuit!
>
> This problem is actually quite severe. It means that every
> time we add a table or even an attribute, we have
> to audit every query in the system to see if it breaks.
> In a distributed computing world, we may not even
> have access to all of the client code.

Yes, it is a serious problem and therefore such queries should be applied very carefully. Actually, they are intended more for interactive use rather than to be hard-coded into a system. For example, an accountant might want to get some info from a database (to generate a report) and in this case such query could be very useful. In the case of ambiguity the system could ask for more information by choosing among available options in the path.

> In fact, the better solution to this problem is making better
> use of natural join. Schema design can and should be
> done with natural join in mind. Specifically, if there is
> any kind of key relationship between two tables, then
> the keys should have the same names, and the table
> should not have other, non-relating attribute names in
> common. This is actually quite trivial to do. (It's not
> possible in *every* case but it is certainly possible in
> a strong majority.)
>
> At that point, the books per author query becomes:
>
> Select * from Books natural join Authorship natural join
> Persons where PersonId = 'Steven King';
>
> Quite simple. And the editors query is:
>
> Select * from Books natural join Editorship natural join
> Persons where PersonId = 'George R. R. Martin';
>
> In fact, the above queries make quite clear what the inter-table
> roles and relationships are, whereas omitting those relationships
> leaves this question quite unclear.
>

I would not qualify this solution as better or worse -- it is one possible solution. It is better because it is less ambiguous but it is worse because it is more complex. Using natural joins makes the situation with dependence somewhat better but not too much because your queries are still dependent on future changes in the schema. In particular, this query can become either ambiguous or simply invalid if we add/remove/update attributes or tables.

> This does not address how to handle multiple foreign
> keys to the same table. In that case it is necessary
> to rename the key attribute in the foreign table before
> joining, and one can do so more than once and
> it again makes the relationship quite clear. The other problem
> is accidental attribute "capture" in the natural join conditions.
> (That is, if two tables being natural-joined have a non-key
> attribute in common that we didn't notice, we overly restrict
> our join condition.) This is best handled, again, by avoiding it
> at schema design time. (But maybe it could be made a warning.)

You touched a rather interesting. Actually, your solution is very comparable with what is proposed by the person you talked to. In particular, they have many in common and very similar problems.

Generally, possible solutions can be classified as follows:

  1. Specifying only the starting and ending points while the path is reconstructed automatically. For example, SELECT * from Books where AuthorId = 'Jane Austen'
  2. Specifying some intermediate points in the path so that it can be reconstructed unambiguously. For example; SELECT * from Books where AuthorId = 'Jane Austen' via BooksAuthors. Here we use keyword 'via ' to specify an intermediate table.
  3. Specifying all points in the path in order to make the query even more stable. For example, using natural join, it can be written as follows: Select * from Books natural join Authorship natural join Persons where PersonId = 'Steven King';
  4. Specifying for 3 and 4 some or all attribute names. For example, in the concept-oriented model (CoM) it could be written as follows: {Authors WHERE Name= 'Jane Austen'} -> {BooksAuthors->author} -> book Here we de-project the selected authors to BooksAuthors and then project this result to Books. Notice that we specify attribute names but the intermediate tables sometimes can be omitted.
  5. Specifying how tables have to be joined in addition to attribute names (explicit joins).

Your mentioned only one problem which arises in this situation, namely, ambiguity of the path. It is not actually a problem because we do it deliberately, i.e., if we need simplicity then we pay for that by some ambiguity which has to be resolved (including natural joins), and if we want to have precise query then we need to provide more information.

However, the task of automatic constraint propagation has much more interesting and complex problems. Assume that you have an arbitrary graph of tables. Some constraints are imposed on one or more source tables. How to propagate these constraints to the target table in order to get consistent and meaningful results? So I would formulate the problem as an automatic constraint propagation rather than abstraction of table relationships (as written in the subject). Missing intermediate tables or attributes can be easily provided in the query as parameters while propagating constraints meaningfully requires understanding what the data semantics is.

> Marshall
>

--
http://conceptoriented.com
Received on Thu Sep 07 2006 - 11:34:21 CEST

Original text of this message