Re: "Fuzzy" text search using n-grams (bigrams) -- speed?

From: Patricia Shanahan <pats_at_acm.org>
Date: Thu, 25 Oct 2007 08:06:03 -0700
Message-ID: <ffqbc8$13s9$1_at_ihnp4.ucsd.edu>


crazyhorse wrote:
...
> I think the string distance is too expensive to compute in a database,
> and again, stemming is not really what I need.
>
> For this application, it's also not just misspelled words -- it's
> skipping a word in the movie title, or using an alternate form (i.e.
> "Stephen" or "Steven"), or specifying a longer version when we only
> have a shorter title in our database. I'm looking for a fuzzy,
> flexible search in general that can be implemented in a database.
>
> No real strong ideas for this, huh?
>

How about a multi-layer approach?

  1. Look up the actual words against the actual movie titles. That will pick up deliberately and distinctively unusual names and spellings. For example, it will match "shrek".
  2. Apply one or more layers of normalization, and check against a normalized version of the titles.

For example, reduce all forms of names with multiple common forms to a single form. Do spelling correction with a US English dictionary. Delete articles ("the","a" ...).

Patricia Received on Thu Oct 25 2007 - 17:06:03 CEST

Original text of this message