Re: "Fuzzy" text search using n-grams (bigrams) -- speed?

From: crazyhorse <>
Date: Thu, 25 Oct 2007 07:53:44 -0700
Message-ID: <>

On Oct 25, 10:05 am, diogomartins <> wrote:
> Hi,
> One common approach to deal with query terms mispellings in
> Information Retrieval systems is to employ some sort of string
> distance technique, whose most common approach is described in [http://
>]. Techniques like that can
> identify, by syntax, the nearest terms, similarly to suggestions given
> by spell checkers of word processors. I think that using stemming in
> conjunction with string distance can satisfy your requirements. The
> really hard part would be to put that functionalities inside MySQL....
> []'s
> Diogo
> On Oct 24, 8:03 pm, crazyhorse <> wrote:
> > Hi all,
> > Let's suppose I'm writing a website where users search for movie
> > titles, and suppose there are 200,000 movies. The site is in PHP/
> > MySQL.
> > I'd like to implement a "fuzzy" text search so that similar movie
> > titles come up in a list no matter if you include all the words or
> > even misspell them.
> > MySQL's fulltext-search is a first step, but it doesn't cover mis-
> > spellings.
> > My question is this: suppose I broke down every movie title into
> > bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
> > relationship between every bigram and every movie title that includes
> > it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
> > "movie_id").
> > I break down every search string into bigrams. Then, for every search,
> > I retrieve all the rows from BIGRAMS that include any of the bigrams
> > in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
> > and take the top matches.
> > Comments? Anybody have experience with this? Does it work? Or would it
> > be far too slow? Is there another better solution I'm missing? We're
> > talking 200,000 movies, which means BIGRAMS might have 4 million rows
> > or so.
> > Thanks
> > Michael

I think the string distance is too expensive to compute in a database, and again, stemming is not really what I need.

For this application, it's also not just misspelled words -- it's skipping a word in the movie title, or using an alternate form (i.e. "Stephen" or "Steven"), or specifying a longer version when we only have a shorter title in our database. I'm looking for a fuzzy, flexible search in general that can be implemented in a database.

No real strong ideas for this, huh? Received on Thu Oct 25 2007 - 16:53:44 CEST

Original text of this message