Re: High Speed Text Searching Algorythms...

From: Bob Stearns <is_at_uga.edu>
Date: 2000/07/20
Message-ID: <397702AF.662C1D87_at_uga.edu>#1/1


Paradox wrote:
>
> As an interesting CS project, I'm looking into different text searching
> algorythms, everything from simple strncmp to things like
> Knuth-Morris-Pratt and Boyer-Moore algorythms. But I was wondering, I
> know company's like Google (and proably other high speed internet search
> engines) have their own implimentations of search engines. But it seems
> as though for text searching, those are extreamly fast, proably
> searching records in the billions in under 1 second. What kind of
> algorythms do systems like that use? Are there any places online that I
> can find descriptions of these algorythms?
>
> Any reccomendations would be greatly appreciated....
>
> Thanks
> Dave
> sbin_at_mindless.com
A very astute observation. To search the numbers of records those sites handle, in the time they do it in, they cannot afford to use a text searching algorithm. Instead they use some sort of inverted file index, where each word is a key with the record identifiers of records with that that word as data. Representation is critical to speed with OR and AND operations on lists the commonest operations. Growth is fairly simple, but deletion of records is fairly complex without complete reorganization. If truncation of words on the left (*ore) is required, then each word is stored reversed as well; if truncation on both sides at once (*see*) is required, more complex storage system is required for the words.

-- 
Bob Stearns
University of Georgia
is_at_groucho.dev.uga.edu
(706)542-5110
Received on Thu Jul 20 2000 - 00:00:00 CEST

Original text of this message