Re: Boolean Query Algorithm

From: Bob Stearns <rstearns1241_at_charter.net>
Date: Thu, 13 Jul 2006 03:23:45 -0400
Message-ID: <Iymtg.941$BE4.353_at_fe04.lga>


Sherrie Laraurens wrote:

> with regards to partial matching, say ba* ca* what are the technqiues
> used there? p-tries perhaps?
> 

I actually just read the index sequentially starting with the first entry greater than or equal to the fragment and followed the same logic as for whole words. The index was maintained as a VSAM keyed file, but any good file/database system could be used. Now with the BLOB support in modern databases, the corpus size is effectively unlimited. My index used several methods, depending on word frequency, to store the document list for a word/position, but the largest entries were for words occurring in between 3% and 97% of the documents, where I used a run length compressed bit string to represent the document set. I did limit the fragment size in truncated searches to at least 3 characters, and permuted the search so that the longest word/fragment was searched first.

> and when evaluating the query against the result data how does one > say for example integrate the pivoted cosine measure into the process? Since we didn't keep a word frequency dictionary for each document, we didn't have a good relevance measure. Fortunately, our user population was more interested recall than precision. However, creating such a frequency dictionary for each document would be fairly easy and could be used to give a fairly good relevancy ranking.

> 
> Sherrie
> 
> Bob Stearns wrote:
> 

>>Marshall wrote:
>>
>>>Sherrie Laraurens wrote:
>>>
>>>
>>>>I have a question relating to how search engines and (in fact
>>>>anything else that supports boolean queries) manage to do such
>>>>things so efficiently.
>>>>
>>>>my question involves a query were you would like to retrieve the all
>>>>the documents in your corpus that have the words "Cat" and "Bat" in
>>>>them and that they not only contain both words but that they must be
>>>>consecutive for example "Bat Cat"
>>>>
>>>>I can think of a very crude way of doing this which involves hashing
>>>>every word in a document into a hash table and storing an index for
>>>>said document , then in the query stage to hash both words (hash
>>>>join) get the intersection vector of the resulting vectors from the
>>>>hashing process, then one by one examine each document from the
>>>>intersection vector find the word "Bat" and see if the word "Cat" is
>>>>the next word if it is place said document into the final result
>>>>vector and once finished pass back to user.
>>>>
>>>>I believe this will work for AND and OR type queries but I can't
>>>>imagine systems like google or yahoo using such a CRUDE method nor
>>>>can i see them caching such results, just because the sheer amount
>>>>of combinations things they would have to be caching.
>>>>
>>>>Does anyone here have any idea on how these things are done? any
>>>>keywords I could use to google etc..
>>>
>>>
>>>I think you've descrbed it pretty accurately. Web search engines
>>>are essentially ginormous inverted indexes. Note that the final
>>>stage, searching specifically for BAT and CAT can also benefit
>>>from indexing.
>>>
>>>
>>>Marshall
>>>
>>
>>I've implemented such systems. As long as you are doing full word phrase
>>searches, you extend your index to (word, position, document_ids). To
>>search for "bat cat" you use the index entries ('bat',n,doclist) to look
>>for entries ('cat',n+1,doclist) and and the doclists; repeat for each n
>>associated with 'bat', oring the doclists resulting from each n. Easily
>>extensible 'bat' near 'cat'. Right truncation (bat* cat*) is handleable
>>by the same logic, but can run awhile unless you limit the fragment size.
>
> Received on Thu Jul 13 2006 - 09:23:45 CEST

Original text of this message