slightly OT - cleaning up "dirty" keys?

From: bugbear <bugbear_at_trim_papermule.co.uk_trim>
Date: Wed, 01 Mar 2006 13:44:53 +0000
Message-ID: <4405a556$0$3558$ed2619ec@ptn-nntp-reader03.plus.net>

If (!) one had a database where a primary key field (e.g. name) had been used for a few years, and the DB had serveral "variant" spellings
(e.g. "J Smith", "John Smith", "J K Smith", "J. Smith"
all for the same induividual) does
anyone know of a tool that would identify "likely" groupings.

One would like 2 names with a small
"edit distance"
http://en.wikipedia.org/wiki/Edit_distance to be put together, for human checking.

But if one had 100,000 keys, this would
involve (in a naive implementation)
10^10 comparisons.

Does anyone know a good algorithm
(an/or heuristic if this is NP-hard)

BugBear Received on Wed Mar 01 2006 - 07:44:53 CST