Re: "Fuzzy" text search using n-grams (bigrams) -- speed?

From: David Cressey <cressey73_at_verizon.net>
Date: Thu, 25 Oct 2007 08:12:10 GMT
Message-ID: <udYTi.22922$Qj3.2993_at_trndny01>


"crazyhorse" <mjbaldwin_at_yahoo.com> wrote in message news:1193266983.699143.273060_at_q5g2000prf.googlegroups.com...
> Hi all,
>
> Let's suppose I'm writing a website where users search for movie
> titles, and suppose there are 200,000 movies. The site is in PHP/
> MySQL.
>
> I'd like to implement a "fuzzy" text search so that similar movie
> titles come up in a list no matter if you include all the words or
> even misspell them.
>
> MySQL's fulltext-search is a first step, but it doesn't cover mis-
> spellings.
>
> My question is this: suppose I broke down every movie title into
> bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
> relationship between every bigram and every movie title that includes
> it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
> "movie_id").
>
> I break down every search string into bigrams. Then, for every search,
> I retrieve all the rows from BIGRAMS that include any of the bigrams
> in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
> and take the top matches.
>
> Comments? Anybody have experience with this? Does it work? Or would it
> be far too slow? Is there another better solution I'm missing? We're
> talking 200,000 movies, which means BIGRAMS might have 4 million rows
> or so.
>
> Thanks
> Michael
>

If you do a search on "Soundex" you will come up with some tools for detecting near misses that sound right.

The most common typo is swapping two letters, and this kind of error would probably defeat a pure soundex implementation. But I wouldn't be surprised if sites that offer a soundex tool might offer what I'll call a "typex tool", for lack of a better name. Received on Thu Oct 25 2007 - 10:12:10 CEST

Original text of this message