Re: High Speed Text Searching Algorythms...

From: Marc Tardif <intmktg_at_Gloria.CAM.ORG>
Date: 2000/07/21
Message-ID: <Pine.LNX.4.10.10007211049380.18785-100000_at_Gloria.CAM.ORG>#1/1


> As an interesting CS project, I'm looking into different text searching
> algorythms, everything from simple strncmp to things like
> Knuth-Morris-Pratt and Boyer-Moore algorythms. But I was wondering, I
> know company's like Google (and proably other high speed internet search
> engines) have their own implimentations of search engines. But it seems
> as though for text searching, those are extreamly fast, proably
> searching records in the billions in under 1 second. What kind of
> algorythms do systems like that use? Are there any places online that I
> can find descriptions of these algorythms?
>
A text searching project is indeed very interesting, but can become fairly ambitious. The algorithms you mention enter in the sequential searching for single patterns category. What you are looking for is pattern matching using indices, which covers inverted files, suffix trees and suffix arrays. All these imply converting the original data into an intermediate format for fast retrieval. Apart from the gain in speed, other advantages include compression, structured queries and ranking.

A good readable example of pattern matching using indices is Liam Quin's lq-text software, which is open-source and well documented in his usenix paper [see www.groveware.com/~lee/lq-text]. This implementation focuses on accurate phrase matching. Other implementations, such as glimpse, offer extended features such as string matching allowing errors.

Since you mention web search engines, these may have different goals than the implementations above. For instance, ranking needs to be determined from the index without accessing the actual text. Algorithms for this are fairly complex to explain over usenet, so I suggest looking into books such as Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto. Received on Fri Jul 21 2000 - 00:00:00 CEST

Original text of this message