Re: DB schema -> graph, good strategies

From: Martin Christensen <knightsofspamalot-factotum_at_gvdnet.dk>
Date: Mon, 28 Oct 2002 20:09:01 +0100
Message-ID: <87smyqwfci.fsf_at_gvdnet.dk>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>>>> "Jan" == Jan Hidders <hidders_at_REMOVE.THIS.uia.ua.ac.be> writes:
Jan> I see. And the basic idea is rather simple, isn't it; if you are
Jan> looking for a set of key words then they don't all have to be in
Jan> the same column or the same tuple, but they can also be in
Jan> related tuples where relatedness is measured in the number of
Jan> foreign-key jumps. Looks very much like taking proximity into
Jan> account, only generalized to more than two dimensions. I wonder
Jan> if you could use the same techniques.

I would be shooting myself in the food if didn't, now wouldn't I? :-) Considering only single-tuple search results is next to useless in most production environments.

[Directionality of schema graphs]

Jan> Now that I know that I would say that in general the direction
Jan> doesn't matter. If two tuples are related then it wouldn't make
Jan> sense to say that the first is related to the second but the
Jan> second not to the first. Having said that I could imagine that
Jan> for certain relationships where you know what they mean
Jan> (e.g. "contains", or "links to") there could be some asymmetry,
Jan> but that would be a semantic issue and not something you could
Jan> tell from looking at the schema and certainly not the direction
Jan> of the foreign key reference.

This is exactly the core issue I'd like to explore further, although I haven't been able to find an obviously good place to start. Given the directionality fk -> pk, a graph

A <- B -> C

with a schema like A: (foo, bar, id1), B: (id1, id2), C: (baz, quux, id2) would obviously very likely be a plain many-to-many relationship between A and C. In most cases it would make more sense for the relationship between A and C in this case to be closer than if the graph had been A -> B -> C. How can this knowledge be exploited? And in which areas? Interesting stuff.

Jan> The same would probably be true for the importance of certain
Jan> relationships (some might be "small jumps" and others could be
Jan> "big jumps") but also that is a semantic issue.

Exactly. This way the schema graph could be made weighted. They take a primitive approach to this in the BANKS paper I mentioned the other day; that is to say their consideration of directionality is somewhat on the simple side, but am not evaluating the other mechanisms they use for weighting their graphs.

>> Imagine we have some ER model that represents our data as we prefer
>> to think of it. Anyone with basic knowledge of databases know that
>> when transfering this model to a relational database we need to do
>> data normalisation.
Jan> I'm not so sure of that. It is my experience that if you make a Jan> good ORM / ER diagram you hardly ever need to normalize.

That depends on who makes them. If you get a DBA to make them, then they'll likely need less normalisation than if you get non-techie business folks to do it. My main point, however, remains the same: that most of the people who need to query databases will think of data in rather more abstract terms than what is actually implemented. And they don't find schema, left inner joins and arcane text-based interfaces as fascinating as we do. :-)

Jan> To the extent that it can be done that's actually quite
Jan> simple. What you want is to find out whether a table corresponds
Jan> to an entity type, a relationship type or (if you allow those) a
Jan> mixed type. You can scratch possibilities using the rules that
Jan> hold for the ER model you use, until no more possibilities can be
Jan> scratched. For example it is true in almost all models that
Jan> foreign keys only arrive in entity types and mixed types but
Jan> never in relationship types.

This assumes that we have the original model, which is an assumption I'm not overly fond of.

But I caught myself not asking a very fundamental question: assuming we could recreate the original model, what then? Is it just good for ranking and/or crawling?

Jan> This gets more difficult/ambiguous if you have rules that say
Jan> something like you cannot have cycles with at least on entity
Jan> type in them.

I'm afraid I don't understand what you mean.

Martin

iEYEARECAAYFAj29i00ACgkQYu1fMmOQldU0FQCeNzgkNDW+nuMYQS94axBiU0kq Cj4AoOVgUC6BoOG02XUZ5OKd54anskoh
=IoSe
-----END PGP SIGNATURE----- Received on Mon Oct 28 2002 - 20:09:01 CET

Original text of this message