Re: Approximate/Fuzzy Search for strings keys

From: jefftyzzer <>
Date: Thu, 19 Jun 2008 19:47:10 -0700 (PDT)
Message-ID: <>

On Jun 19, 9:36 am, wrote:
> Dear all,
> I have read the literature on fuzzy/approximate search for strings
> keys in large databases, but I could not find a clear answer on what
> is the best that has been achieved thus far in terms of the cost of
> search (regardless of cost of building any index).
> I understand this depends on the application/domain/type of data, but,
> to keep things simple, supoose the target data is some string (names,
> addresses, or any sequence of characters), and assume any distance
> function. I am simply asking this: what is the best that has been
> achieved? Is there a logarithmic solution (i.e., that can retrieve a
> set of "similar" objects in some logarithmic order), or are they all,
> in the worst case, linear?
> I thank you in advance...


I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and tokenbased  camps I see popping up quite a bit in the literature are Jaro-  Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising.

There are scores of excellent papers out there, but two of my current favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks" available here:


"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here:

Here are links to some researchers that are leaders in this area:

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler
_Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:


--Jeff Received on Thu Jun 19 2008 - 21:47:10 CDT

Original text of this message