Re: Keyword searches

From: Rico <TrooperRico_at_bugplanet.com>
Date: Sun, 15 Jul 2001 22:00:22 -0400
Message-ID: <uSr47.10648$P%2.10155_at_newsfeed.slurp.net>


"Simon Richter" <geier_at_phobos.fs.tum.de> wrote in message news:Pine.LNX.4.31.0107152159000.2609-100000_at_phobos.fachschaften.tu-muenchen .de...
> Hi,
>
> I'm in the process of hacking up my own Gnutella client. For this, I need
> a function which returns a set of files that matches a certain search
> query (which is a set of keywords). Each file I share with the world also
> has a set of keywords associated with it. Now I need to match this list
> against the query efficiently.
>
> My current approach would be the "bitmask" method, i.e. to hash these
> keywords to a fixed-length bitmap by associating certain character
> combinations with bits in the mask, setting them if a character
> combination appears in the hashed string. Then if (query_mask & file_mask)
> == query_mask I examine the file further to see if it really matched.
>

The area you've just discovered is "information retrieval" and there's a whole host of techniques you can use. This particular technique is a simplified boolean model, but it has problems if the document is missing some of the keywords, even though it may be highly relevant to what the user wants. It also won't offer any ranking to tell the user how relevant the document is to their query (since a doc either matches or doesn't match).

> My questions:
>
> - Is there a (preferably GPL-compatible) library that does keyword
> searches, so I can skip all this? :-)
> - Is there a better way to do keyword searches?

Yes, definitely. Do a web search on the term "information retrieval models" and you should get some suggestions. A good book on the subject, if you don't mind doing some math, is Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Don't let the equations scare you, it's essentially a lot of adding, multiplying, and dividing at the end of the day. I'd suggest reading about the vector model, query language issues, and then skip over to indexing. That should be enough to make you dangerous. :-) Received on Mon Jul 16 2001 - 04:00:22 CEST

Original text of this message