Re: High Speed Text Searching Algorythms...

From: Adam McKee <amckee_at_home.com>
Date: 2000/07/23
Message-ID: <Mrye5.4692$Y5.160060_at_news1.sshe1.sk.home.com>#1/1


In comp.lang.c++ Derrick Coetzee <dc_at_moonflare.com> wrote:
> I know nothing about the reality of the subject myself, but it seems to me
> the best way is probably to assign each page a number, then add the numbers
> of all pages containing a certain word to that word's "page list". In this
> way, you can build up a sorted database of words:
 

> chicken 5 1 2 3 6 7
> apple 4 11 5
> monster 15 4 9 2
 

> Then you can treat these as sets... if they specify two words, you find the
> intersection:
 

> +chicken +apple
> {4 11 5} intersection {5 1 2 3 6 7} = { 5 }
 

> You can also organize each list according to its relavancy to that
> particular word.
 

> However, this idea works very badly for quoted multiword searches, unless
> you put in entries for each set of multiple words, and that'd get quite
> excessive.

For each word "hit", also store the position of that word within the document (i.e. it is the n'th word). So if you are searching for "Two Words", you would find the documents that contain both words (using intersection method you described), then look at word hits for both words within those documents, and see if the 2nd word has position (n + 1) (and therefore occurs immediately after 1st word). This can be done in haste.

        -Adam Received on Sun Jul 23 2000 - 00:00:00 CEST

Original text of this message