Re: "Fuzzy" text search using n-grams (bigrams) -- speed?
Date: Thu, 25 Oct 2007 08:06:03 -0700
Message-ID: <ffqbc8$13s9$1_at_ihnp4.ucsd.edu>
crazyhorse wrote:
...
> I think the string distance is too expensive to compute in a database,
> and again, stemming is not really what I need.
>
> For this application, it's also not just misspelled words -- it's
> skipping a word in the movie title, or using an alternate form (i.e.
> "Stephen" or "Steven"), or specifying a longer version when we only
> have a shorter title in our database. I'm looking for a fuzzy,
> flexible search in general that can be implemented in a database.
>
> No real strong ideas for this, huh?
>
How about a multi-layer approach?
- Look up the actual words against the actual movie titles. That will pick up deliberately and distinctively unusual names and spellings. For example, it will match "shrek".
- Apply one or more layers of normalization, and check against a normalized version of the titles.
For example, reduce all forms of names with multiple common forms to a single form. Do spelling correction with a US English dictionary. Delete articles ("the","a" ...).
Patricia Received on Thu Oct 25 2007 - 17:06:03 CEST