Approximate structural conditions in trees

From: <Kai.Grossjohann_at_CS.Uni-Dortmund.DE>
Date: 9 Jul 2002 22:55:23 -0500
Message-ID: <vafelejctqg.fsf_at_lucy.cs.uni-dortmund.de>



Suppose I have trees (ordered node-labeled trees, vulgo XML documents, to be precise) and I want to search them for some structural conditions. For example, I might want to search nodes labeled A which have child-nodes labeled B.

Now I assume that the queries are not precise since the user formulating them does not know what the trees look like. In our example, A could be a grandparent of B but the system should still find it.

Another example would be that the user looks for something like this:

    A
   / \
  B B

But the tree looks like this:

    B
   / \
  A A

Maybe this tree should still be found.

Now a number of questions arise:

  • Which other structural conditions could a user formulate, and which kinds of "relaxations" do we want to allow that are still to be found?
  • Can we quantify the goodness of "near matches", so that we can say which tree is "better"? The ideal approach would be to consider that the user judges trees as relevant or not and to have the system estimate that probability. Of course, hundreds of similarity measures could be defined easily, but how do they relate to the user? Which ones are a good approximation of the user's idea of relevance?

I'm aware that there is some work on similarity between trees (one approach is the tree editing distance), but some things are not clear to me:

(1) A query is not the same as a complete tree. For instance, a

    query is much smaller than a complete tree. Consider that the     trees are books in XML format, and that the query might be to     search for chapters with a certain heading. Then the query just     contains two nodes, but the documents will presumably contain     thousands of nodes.

    Also, a query could contain other kinds of conditions. For     example, what if the query says to search for A elements which     have either a B child or a C child? It's clear that this does     not map to a simple tree.

    So it is problematic to apply a similarity measure between trees     if one tree could be very different from the other one, or maybe     one tree is not a tree at all!

(2) Is the tree edit distance a good measure of the users' idea of

    relevance? If one tree has a low distance and the other has a     high distance, will the user really prefer the tree with the low     distance?

Is there work on this? What are the keywords to feed to Google for this?

kai

-- 
A large number of young women don't trust men with beards.  (BFBS Radio)
Received on Wed Jul 10 2002 - 05:55:23 CEST

Original text of this message