Discovering new relationships

From: Walt <wamitty_at_verizon.net>
Date: Fri, 02 Mar 2007 20:01:50 GMT
Message-ID: <Oo%Fh.2$Tf.1_at_trndny03>



It seems to me that, sometimes, relationships in data are discovered by other people than the ones who initially tabulated the data. This is tangentially related to the discussion on constraints and functional dependencies.

Let me start with a couple of examples.

The first example involves a scenario that I'll call the "case of the missing validation table".

We've got a schema with tables about students, and tables about campuses for a university.
Somebody comes up with this query (apologies for using SQL).

select s.first_name, s.last_name, c.campus_name from students as s, campuses as c
where s.zip_code = c.zip_code

I have no idea why someone suddenly decided this query was useful, but I'm going to suggest that the theta operator between the two zip_code columns is entirely within the meaning of the data. Intuiting that the zip_code columns are some sort of reference, we look around for a table whose primary key is zip_code, only to find that there isn't one! At least, there isn't one in this schema.

It turns out that our database doesn't distinguish between valid zipcodes and invalid ones, and doesn't have any constraints in that regard. Of course there's a table of all valid zip_codes, somewhere in the US postal services databases, but that's outside our universe of discourse.

Now, this join doesn't necessarily conform to Marshall's referential integrity constraint. There could be students with no matching campuses, or campuses with no matching students. Because the realtionship is many-to-many, there could also be multiple result rows with the same student or the same campus. As a side issue, two different students could have the same first_name and last_name, but I'm not dealing with that.

What I'm going to say about this is that both s.zip_code and c.zip_code are "dangling foreign keys". They are foreign keys to a primary key that isn't defined in this schema. And I'm going to suggest that this kind of join is occasionally legitimate, and that there is no mathematical reason to disallow it.

The second example is a little farther afield. We have two relational tables R and S that have no domains in common. That is, the intersection of (the set of the domains of all the attributes of R and the set of the domains of all the attributes of S) is empty. It looks like we can't find any basis for naturally joining the two tables.

But let's says that someone comes along and discovers (or invents, if you prefer) a pair of functions, f and g such that the range of the two functions (the output domain) is the same, but f takes an input from a domain that's the same as one of the attributes of R (let's call it "a") and g takes an input from a domain that's the same as one of the sttributes of S (let's call it "b"). Addtionally, let's say that the dialect of SQL allows user defined functions, and that f and g are defined this way.

Now we've got the following join.

select R.*, S.*
from R, S
where f(R.a) = g(S.b)

Again, my apologies for using SQL notation rather than mathematical notation.

Now we've got a join that the designer(s) of R and S might not have contemplated. The question is, is such a join "natural" or not? Down the road, I expect this discussion to have a lot to do with the usefulness of the relational data model in data intergration efforts, but I can't quite put it into words yet.

I think the two examples where I used SQL notation would be clearer if they were in mathematical notation, but I'm too timid to try. Can anybody help? Received on Fri Mar 02 2007 - 21:01:50 CET

Original text of this message