"Fuzzy" text search using n-grams (bigrams) -- speed?
Date: Wed, 24 Oct 2007 16:03:03 -0700
Message-ID: <1193266983.699143.273060_at_q5g2000prf.googlegroups.com>
Hi all,
Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL.
I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no matter if you include all the words or even misspell them.
MySQL's fulltext-search is a first step, but it doesn't cover misspellings.
My question is this: suppose I broke down every movie title into bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a relationship between every bigram and every movie title that includes it in a separate table, BIGRAMS. (Will have two fields, "bigram" and "movie_id").
I break down every search string into bigrams. Then, for every search, I retrieve all the rows from BIGRAMS that include any of the bigrams in the search string. I then calculate COUNT(movie_id) ORDER BY DESC and take the top matches.
Comments? Anybody have experience with this? Does it work? Or would it be far too slow? Is there another better solution I'm missing? We're talking 200,000 movies, which means BIGRAMS might have 4 million rows or so.
Thanks