Re: Boolean Query Algorithm

From: Bob Stearns <rstearns1241_at_charter.net>
Date: Wed, 12 Jul 2006 23:53:18 -0400
Message-ID: <otjtg.902$BE4.82_at_fe04.lga>


Marshall wrote:

> Sherrie Laraurens wrote:
> 

>>I have a question relating to how search engines and (in fact
>>anything else that supports boolean queries) manage to do such
>>things so efficiently.
>>
>>my question involves a query were you would like to retrieve the all
>>the documents in your corpus that have the words "Cat" and "Bat" in
>>them and that they not only contain both words but that they must be
>>consecutive for example "Bat Cat"
>>
>>I can think of a very crude way of doing this which involves hashing
>>every word in a document into a hash table and storing an index for
>>said document , then in the query stage to hash both words (hash
>>join) get the intersection vector of the resulting vectors from the
>>hashing process, then one by one examine each document from the
>>intersection vector find the word "Bat" and see if the word "Cat" is
>>the next word if it is place said document into the final result
>>vector and once finished pass back to user.
>>
>>I believe this will work for AND and OR type queries but I can't
>>imagine systems like google or yahoo using such a CRUDE method nor
>>can i see them caching such results, just because the sheer amount
>>of combinations things they would have to be caching.
>>
>>Does anyone here have any idea on how these things are done? any
>>keywords I could use to google etc..
> 
> 
> I think you've descrbed it pretty accurately. Web search engines
> are essentially ginormous inverted indexes. Note that the final
> stage, searching specifically for BAT and CAT can also benefit
> from indexing.
> 
> 
> Marshall
> 

I've implemented such systems. As long as you are doing full word phrase searches, you extend your index to (word, position, document_ids). To search for "bat cat" you use the index entries ('bat',n,doclist) to look for entries ('cat',n+1,doclist) and and the doclists; repeat for each n associated with 'bat', oring the doclists resulting from each n. Easily extensible 'bat' near 'cat'. Right truncation (bat* cat*) is handleable by the same logic, but can run awhile unless you limit the fragment size. Received on Thu Jul 13 2006 - 05:53:18 CEST

Original text of this message