Re: Semistructured data

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Fri, 27 Oct 2006 12:03:21 +0300
Message-ID: <Pine.SOL.4.62.0610271046590.17989_at_kruuna.helsinki.fi>


On 2006-10-26, Marshall wrote:

>> [...] the final, disjunction aware database ends up being emulated on
>> top of the RM.
>
> Can you give an example?

Take the medical expert system. Many of the diagnostic rules they store are disjunctions, because from a limited set of symptoms and diagnostics you can predict a number of different illnesses. (What makes a good expert system is that it can do heuristic and statistical inference and use them to ask the right questions so that the alternatives are rapidly and cheaply narrowed down.) The system might also contain rules which enable it to second guess test results and the like.

Take the normal semantics of the relational model. Assume we have two predicates, p(x,y) and q(z), meaning patient x has disease y and test z has failed, respectively. It's entirely possible that some rule in the system has as its consequent p(x,y) or q(z), and that you want to be able to store that in the database as soon as the antecedent of the rule can be evaluated, so that the next time the patient comes in, the doctor can consider whether to diagnose x with y or to order z to be redone.

If you use two vanilla tables to store these predicates, you'll get the wrong semantics. Inserting (x,y) into p and (z) into q means that the patient both has the disease and that the test failed.

You can still fit the stuff into RM. Two alternatives that immediately come to mind is to store all the disjunctive stuff as clear text logical formulae -- really nasty -- or to reify p and q and point. In the latter, instead of taking (x,y) in p to mean the fact that x has disease y, you interpret it to mean the corresponding statement which may or may not be true, and assign an id to it. Then you have secondary relvars containing the structure of the formulae, e.g. r(x,y) meaning formula x is parent of formula or statement y and q(x,z) where x is formulae and z is the logical connective applied to its children. Now you can represent what was originally p(x,y) as {p(f1,x,y), r(f2, f1), q(f2, and)} and the disjunction as {p(f1,x,y), q(f2,z), r(f3,f1), r(f3,f2), q(f3,or)}. (This is not the only design possible and probably not even the best, but it suffices to illustrate the point.)

The problem is that the database has no idea about what this all means. When asked, it can easily list all of the statements and formulas, but in order to get out only the true facts, you'll need extensive reasoning capability on top. The data model that lives inside the database is far more complex than the fragment of propositional logic the RM captures, so the RM is mostly blind to it and is only used to emulate the real thing.

Finally, the database isn't really even equipped to deal with the pattern of queries that result. A typical query against this sort of database would ask which illnesses are still possible after considering some symptoms, the tests done, the medication in effect and the patient's history. In this representation you can still apply single table recursive joins (of the bill-of-materials type) to get the relevant formulae and join everywhere from there, but both the number of columns and the number of rows in such queries suffer from combinatorial explosion. Then there are going to be predicates describing incompatible medications, side effects, afterillnesses, combined risk factors and whathaveyou. Those translate into path queries and mutually recursive transitive joins, with intermediate results that are going to be huge unless the database actually understands the logic and can prune the search intelligently. The only way to implement something reasonable on top is to have the logic middleware navigate the database, but then high performance navigation is again something that current relational software doesn't support as well as, say, object frameworks.

-- 
Sampo Syreeni, aka decoy - mailto:decoy_at_iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Fri Oct 27 2006 - 11:03:21 CEST

Original text of this message