Path: news.easynews.com!easynews!news-out.visi.com!hermes.visi.com!eusc.inter.net!Informatik.Uni-Dortmund.DE!not-for-mail
From: Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai
 =?iso-8859-15?q?Gro=DFjohann?=)
Newsgroups: comp.databases.theory
Subject: Re: Approximate structural search in trees
Date: Sun, 07 Jul 2002 18:55:03 +0200
Organization: University of Dortmund, Germany
Lines: 53
Message-ID: <vafbs9j8om0.fsf@lucy.cs.uni-dortmund.de>
References: <vaflm8q2jkp.fsf@lucy.cs.uni-dortmund.de> <3d265250.42819411@news.verizon.net>
NNTP-Posting-Host: lucy.cs.uni-dortmund.de
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Mail-Copies-To: never
User-Agent: Gnus/5.090007 (Oort Gnus v0.07) Emacs/21.3.50
 (i686-pc-linux-gnu)
Cancel-Lock: sha1:5/M/FGG7e2Z20cJQ3MV7IeMPfEY=
Xref: easynews comp.databases.theory:21659
X-Received-Date: Sun, 07 Jul 2002 10:01:23 MST (news.easynews.com)

JXSternChangeX2R@gte.net (JRStern) writes:

> On Fri, 05 Jul 2002 19:03:34 +0200, Kai.Grossjohann@CS.Uni-Dortmund.DE
> (Kai =?iso-8859-15?q?Gro=DFjohann?=) wrote:
>>As an example, suppose the query is looking for a node labeled A
>>which has a sibling labeled B.  Now suppose there is no tree which
>>has this, but there is a tree where a node labeled A has a cousin
>>labeled B.  Then I'd like the query to find this tree.
> ...
>
> I don't have any handy literature pointers, but I'll pay you a
> cyber-nickel if you can give a few details on just what any of this
> buys in applications terms.

Well, the application I have in mind is to search XML documents.  Now
one common query will be, I guess, for documents from a specific
author.  In my query language (derived from XPath) this looks like

    /article[author sounds_like "smith"]

For testing, we bought CD-ROMs from the IEEE Computer Society (with
the fulltext articles from 1995 to 1997, in XML format).  Some
articles have an element "fm" (presumably, front matter) in there, so
that the correct query would have to be

    /article[fm/author sounds_like "smith"]

Some articles use `bm' (body matter?) to surround the text of the
article, other articles use `bdy' (body?).

On the one hand, you could require users to know all about the
structure.  But on the other hand, the average length of a path from
the root node to a text node (leaf) in the tree is 5 or 6 in these
documents.  So the potential for the user to make mistakes is quite
big.

> Would it not be likely that the scoring metric in comparison would
> have something to do with the semantics of the application, as well as
> the tree/syntax?

Yes, indeed.  In this application, the presence/absence of the `fm'
element doesn't make much difference, but surely there are other
applications where an element named `fm' is more important.

So maybe deleting a node labeled A has a different effect on scoring
than deleting a node labeled B.  But I'm at an earlier stage: I want
to know which kinds of trees are similar in general, and how do I
compute the similarity between a query and a tree if the query does
not precisely match the tree structure?

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)