Re: finding duplicate records with typo's

From: Marshall <marshall.spight_at_gmail.com>
Date: Mon, 06 Aug 2007 23:08:41 -0000
Message-ID: <1186441721.326191.195380_at_e16g2000pri.googlegroups.com>


On Aug 5, 6:13 pm, tom <tomschur..._at_gmail.com> wrote:
> hello,
>
> can someone tell me (or point me in the right direction) of what the
> right way of finding duplicates in dirty data (caused by typo's) ?
>
> is there something like a 'hashing' or 'rating' of text that will give
> you a number that you can compare ?
>
> for example
>
> hash( "hello") => 4323
> hash( "helo") => 4334
> hash("tree") => 7326
>
> i'm not sure what direction i should look in, this is just an idea
> that i had, but any idea's are very welcome.

It depends on what your application is. If you are looking for strings that *sound* alike, then soundex is a good choice. The algorithm uses knowledge of how specific letters are pronounced. For example, b, p, f, and v are all mapped into the same value, because they all have similar pronunciation, whether bilabial plosive or labiodental fricative, whether voiced or unvoiced.

However there are other applications, and the fact that you mentioned "typos" makes me wonder if could perhaps benefit from something different. A good phrase to look for is "near duplicates."

http://en.wikipedia.org/wiki/Near_Duplicate_Algorithms

Lots of work from Altavista and later Google in this area. You can also look for the phrase "simhash" which is short for similarity hash.

Marshall Received on Tue Aug 07 2007 - 01:08:41 CEST

Original text of this message