Re: Approximate structural search in trees

From: Kai Großjohann <Kai.Grossjohann_at_CS.Uni-Dortmund.DE>
Date: Sun, 07 Jul 2002 18:55:03 +0200
Message-ID: <vafbs9j8om0.fsf_at_lucy.cs.uni-dortmund.de>


JXSternChangeX2R_at_gte.net (JRStern) writes:

> On Fri, 05 Jul 2002 19:03:34 +0200, Kai.Grossjohann_at_CS.Uni-Dortmund.DE
> (Kai =?iso-8859-15?q?Gro=DFjohann?=) wrote:
>>As an example, suppose the query is looking for a node labeled A
>>which has a sibling labeled B. Now suppose there is no tree which
>>has this, but there is a tree where a node labeled A has a cousin
>>labeled B. Then I'd like the query to find this tree.
> ...
>
> I don't have any handy literature pointers, but I'll pay you a
> cyber-nickel if you can give a few details on just what any of this
> buys in applications terms.

Well, the application I have in mind is to search XML documents. Now one common query will be, I guess, for documents from a specific author. In my query language (derived from XPath) this looks like

    /article[author sounds_like "smith"]

For testing, we bought CD-ROMs from the IEEE Computer Society (with the fulltext articles from 1995 to 1997, in XML format). Some articles have an element "fm" (presumably, front matter) in there, so that the correct query would have to be

    /article[fm/author sounds_like "smith"]

Some articles use `bm' (body matter?) to surround the text of the article, other articles use `bdy' (body?).

On the one hand, you could require users to know all about the structure. But on the other hand, the average length of a path from the root node to a text node (leaf) in the tree is 5 or 6 in these documents. So the potential for the user to make mistakes is quite big.

> Would it not be likely that the scoring metric in comparison would
> have something to do with the semantics of the application, as well as
> the tree/syntax?

Yes, indeed. In this application, the presence/absence of the `fm' element doesn't make much difference, but surely there are other applications where an element named `fm' is more important.

So maybe deleting a node labeled A has a different effect on scoring than deleting a node labeled B. But I'm at an earlier stage: I want to know which kinds of trees are similar in general, and how do I compute the similarity between a query and a tree if the query does not precisely match the tree structure?

kai

-- 
A large number of young women don't trust men with beards.  (BFBS Radio)
Received on Sun Jul 07 2002 - 18:55:03 CEST

Original text of this message