Re: "Fuzzy" text search using n-grams (bigrams) -- speed?
Date: Thu, 25 Oct 2007 08:06:03 -0700
> I think the string distance is too expensive to compute in a database,
> and again, stemming is not really what I need.
> For this application, it's also not just misspelled words -- it's
> skipping a word in the movie title, or using an alternate form (i.e.
> "Stephen" or "Steven"), or specifying a longer version when we only
> have a shorter title in our database. I'm looking for a fuzzy,
> flexible search in general that can be implemented in a database.
> No real strong ideas for this, huh?
How about a multi-layer approach?
- Look up the actual words against the actual movie titles. That will pick up deliberately and distinctively unusual names and spellings. For example, it will match "shrek".
- Apply one or more layers of normalization, and check against a normalized version of the titles.
For example, reduce all forms of names with multiple common forms to a single form. Do spelling correction with a US English dictionary. Delete articles ("the","a" ...).
Patricia Received on Thu Oct 25 2007 - 17:06:03 CEST