Re: "Fuzzy" text search using n-grams (bigrams) -- speed?

From: paul c <toledobythesea_at_ooyah.ac>
Date: Thu, 25 Oct 2007 23:58:24 GMT
Message-ID: <A4aUi.144310$th2.130535_at_pd7urf3no>


crazyhorse wrote:
...
>
> I think the string distance is too expensive to compute in a database,
> and again, stemming is not really what I need.
>
> For this application, it's also not just misspelled words -- it's
> skipping a word in the movie title, or using an alternate form (i.e.
> "Stephen" or "Steven"), or specifying a longer version when we only
> have a shorter title in our database. I'm looking for a fuzzy,
> flexible search in general that can be implemented in a database.
>
> No real strong ideas for this, huh?
>

I tend to think this problem lays more in the area of type theory than relational theory and that (conventional) relational db theory doesn't offer much help other than to label such support as what Codd called 'theta' or what I might call 'semi-equal' operations. Besides, the conventional rm theory at least doesn't have much to say about indexing.   It seems untoward to me at least to use a conventional db for type support when a second conventional db would still be needed for the immediate application!

Still, to give my two cents on the actual question, I know that Google publishes some of their fairly gargantuan word combination "tables" (sorry, I've forgotten the term they use, maybe they recognize permuations too) where they rank the frequencies of various phrases. Maybe that would be a better starting point. I remember setting up industry-specific text db's at one time and finding that I could express each industry's lingo in word codes that were 16 or fewer bits long. Soundex reduced the size even further (this was about twenty years ago and we used to pay big bucks for service bureau storage). I'd be interested to know the number of unique words in the 200,000 movie titles but perhaps what's more important is the possibility which seems likely to me that most titles have five or fewer significant words, even if proper names are included. Five words have 120 permutations and that times the number of titles is something more than 20,000,000 which is not a big number as far as conventional indexes are concerned. If looking at it this way seems feasible, then I wouldn't rule out Soundex either. Received on Fri Oct 26 2007 - 01:58:24 CEST

Original text of this message