Re: Approximate structural conditions in trees

From: Kai Großjohann <Kai.Grossjohann_at_CS.Uni-Dortmund.DE>
Date: 11 Jul 2002 08:07:16 -0500
Message-ID: <vafvg7mo61m.fsf_at_lucy.cs.uni-dortmund.de>


hidders_at_hcoss.uia.ac.be (Jan.Hidders) writes:

> In article <vafelejctqg.fsf_at_lucy.cs.uni-dortmund.de>,
> <Kai.Grossjohann_at_CS.Uni-Dortmund.DE> wrote:
>>
>>I'm aware that there is some work on similarity between trees (one
>>approach is the tree editing distance), but some things are not clear
>>to me:
>>
>>(1) A query is not the same as a complete tree. For instance, a
>> query is much smaller than a complete tree.
>
> That's not a problem. You simply change the query from "look for documents
> similar to this pattern" to "look for subgraphs/trees in document that are
> similar to this pattern".

Okay, that works.

>> Also, a query could contain other kinds of conditions. For
>> example, what if the query says to search for A elements which
>> have either a B child or a C child? It's clear that this does
>> not map to a simple tree.
>
> If the user wants to ask a complicated question then he or she should not
> use a fuzzy pattern matching query mechanism.

No? Why not?

I'm quite sure it is a good idea to allow complicated questions in a fuzzy way.

>> So it is problematic to apply a similarity measure between trees
>> if one tree could be very different from the other one, or maybe
>> one tree is not a tree at all!
>
> No problem, you then take "graph edit distance".

I didn't mean that the query is a graph. I meant that the query is something which can't be mapped to a (single) subgraph/subtree of a document. Because of the inclusion of Boolean connectives, for instance.

>>(2) Is the tree edit distance a good measure of the users' idea of
>> relevance? If one tree has a low distance and the other has a
>> high distance, will the user really prefer the tree with the low
>> distance?
>
> I don't think "the user" exists. :-) That means that you shouldn't try to
> come up with the ultimate tool that does exactly what "the user" wants. In
> stead try to come up with different tools with a well-defined and
> predictable behavior such that different users can select the tools that
> suit them best.

In Information Retrieval, we have probabilistic retrieval models which give a theoretic foundation for the computation of the scores. There are some (verifyable) assumptions (though they are approximations only), and from this you can deduce the probability of relevance of the document. You don't need to question whether the probability has been computed correctly: if the assumptions are right, then the result is correct, too.

So if the system performs badly, you can just verify all assumptions in turn until you find the assumption which is responsible for the bad results. (For some value of "just" -- it's clear that it is a lot of work to verify the assumptions.)

It's not clear to me that there are verifyable assumptions behind the tree edit distance computation. You assign "costs" to the deletion of a node or to the insertion of a node, but if the result turns out to be bad (in user studies), you have no idea which of the twenty knobs needs twiddling.

>>Is there work on this? What are the keywords to feed to Google for
>>this?
>
> "query semistructured data by example"

Thanks.

kai

-- 
A large number of young women don't trust men with beards.  (BFBS Radio)
Received on Thu Jul 11 2002 - 15:07:16 CEST

Original text of this message