Re: Boolean Query Algorithm

From: Sherrie Laraurens <sherrielaraurens_at_hotmail.com>
Date: 12 Jul 2006 21:45:00 -0700
Message-ID: <1152765900.668560.292110_at_h48g2000cwc.googlegroups.com>


what type of indexing would be used for the final filtering process?

Marshall wrote:
> Sherrie Laraurens wrote:
> >
> > I have a question relating to how search engines and (in fact
> > anything else that supports boolean queries) manage to do such
> > things so efficiently.
> >
> > my question involves a query were you would like to retrieve the all
> > the documents in your corpus that have the words "Cat" and "Bat" in
> > them and that they not only contain both words but that they must be
> > consecutive for example "Bat Cat"
> >
> > I can think of a very crude way of doing this which involves hashing
> > every word in a document into a hash table and storing an index for
> > said document , then in the query stage to hash both words (hash
> > join) get the intersection vector of the resulting vectors from the
> > hashing process, then one by one examine each document from the
> > intersection vector find the word "Bat" and see if the word "Cat" is
> > the next word if it is place said document into the final result
> > vector and once finished pass back to user.
> >
> > I believe this will work for AND and OR type queries but I can't
> > imagine systems like google or yahoo using such a CRUDE method nor
> > can i see them caching such results, just because the sheer amount
> > of combinations things they would have to be caching.
> >
> > Does anyone here have any idea on how these things are done? any
> > keywords I could use to google etc..
>
> I think you've descrbed it pretty accurately. Web search engines
> are essentially ginormous inverted indexes. Note that the final
> stage, searching specifically for BAT and CAT can also benefit
> from indexing.
>
>
> Marshall
Received on Thu Jul 13 2006 - 06:45:00 CEST

Original text of this message