Re: Boolean Query Algorithm

From: Sherrie Laraurens <sherrielaraurens_at_hotmail.com>
Date: 12 Jul 2006 21:47:49 -0700
Message-ID: <1152766069.847266.160070_at_m79g2000cwm.googlegroups.com>


with regards to partial matching, say ba* ca* what are the technqiues used there? p-tries perhaps?

and when evaluating the query against the result data how does one say for example integrate the pivoted cosine measure into the process?

Sherrie

Bob Stearns wrote:
> Marshall wrote:
> > Sherrie Laraurens wrote:
> >
> >>I have a question relating to how search engines and (in fact
> >>anything else that supports boolean queries) manage to do such
> >>things so efficiently.
> >>
> >>my question involves a query were you would like to retrieve the all
> >>the documents in your corpus that have the words "Cat" and "Bat" in
> >>them and that they not only contain both words but that they must be
> >>consecutive for example "Bat Cat"
> >>
> >>I can think of a very crude way of doing this which involves hashing
> >>every word in a document into a hash table and storing an index for
> >>said document , then in the query stage to hash both words (hash
> >>join) get the intersection vector of the resulting vectors from the
> >>hashing process, then one by one examine each document from the
> >>intersection vector find the word "Bat" and see if the word "Cat" is
> >>the next word if it is place said document into the final result
> >>vector and once finished pass back to user.
> >>
> >>I believe this will work for AND and OR type queries but I can't
> >>imagine systems like google or yahoo using such a CRUDE method nor
> >>can i see them caching such results, just because the sheer amount
> >>of combinations things they would have to be caching.
> >>
> >>Does anyone here have any idea on how these things are done? any
> >>keywords I could use to google etc..
> >
> >
> > I think you've descrbed it pretty accurately. Web search engines
> > are essentially ginormous inverted indexes. Note that the final
> > stage, searching specifically for BAT and CAT can also benefit
> > from indexing.
> >
> >
> > Marshall
> >
> I've implemented such systems. As long as you are doing full word phrase
> searches, you extend your index to (word, position, document_ids). To
> search for "bat cat" you use the index entries ('bat',n,doclist) to look
> for entries ('cat',n+1,doclist) and and the doclists; repeat for each n
> associated with 'bat', oring the doclists resulting from each n. Easily
> extensible 'bat' near 'cat'. Right truncation (bat* cat*) is handleable
> by the same logic, but can run awhile unless you limit the fragment size.
Received on Thu Jul 13 2006 - 06:47:49 CEST

Original text of this message