Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> comp.databases.theory -> Re: "Fuzzy" text search using n-grams (bigrams) -- speed?

Re: "Fuzzy" text search using n-grams (bigrams) -- speed?

From: diogomartins <diogomartins_at_gmail.com>
Date: Thu, 25 Oct 2007 12:05:53 -0000
Message-ID: <1193313953.185582.164420@o38g2000hse.googlegroups.com>


Hi,

One common approach to deal with query terms mispellings in Information Retrieval systems is to employ some sort of string distance technique, whose most common approach is described in [http:// en.wikipedia.org/wiki/Levenshtein_distance]. Techniques like that can identify, by syntax, the nearest terms, similarly to suggestions given by spell checkers of word processors. I think that using stemming in conjunction with string distance can satisfy your requirements. The really hard part would be to put that functionalities inside MySQL....

[]'s

Diogo

On Oct 24, 8:03 pm, crazyhorse <mjbald..._at_yahoo.com> wrote:
> Hi all,
>
> Let's suppose I'm writing a website where users search for movie
> titles, and suppose there are 200,000 movies. The site is in PHP/
> MySQL.
>
> I'd like to implement a "fuzzy" text search so that similar movie
> titles come up in a list no matter if you include all the words or
> even misspell them.
>
> MySQL's fulltext-search is a first step, but it doesn't cover mis-
> spellings.
>
> My question is this: suppose I broke down every movie title into
> bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
> relationship between every bigram and every movie title that includes
> it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
> "movie_id").
>
> I break down every search string into bigrams. Then, for every search,
> I retrieve all the rows from BIGRAMS that include any of the bigrams
> in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
> and take the top matches.
>
> Comments? Anybody have experience with this? Does it work? Or would it
> be far too slow? Is there another better solution I'm missing? We're
> talking 200,000 movies, which means BIGRAMS might have 4 million rows
> or so.
>
> Thanks
> Michael
Received on Thu Oct 25 2007 - 07:05:53 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US