Re: Approximate/Fuzzy Search for strings keys

From: jefftyzzer <jefftyzzer_at_sbcglobal.net>
Date: Thu, 19 Jun 2008 19:47:10 -0700 (PDT)
Message-ID: <032a2bc9-9c0a-494c-ad3c-34da0181c500@x1g2000prh.googlegroups.com>


On Jun 19, 9:36 am, walid.s..._at_gmail.com wrote:
> Dear all,
>
> I have read the literature on fuzzy/approximate search for strings
> keys in large databases, but I could not find a clear answer on what
> is the best that has been achieved thus far in terms of the cost of
> search (regardless of cost of building any index).
>
> I understand this depends on the application/domain/type of data, but,
> to keep things simple, supoose the target data is some string (names,
> addresses, or any sequence of characters), and assume any distance
> function. I am simply asking this: what is the best that has been
> achieved? Is there a logarithmic solution (i.e., that can retrieve a
> set of "similar" objects in some logarithmic order), or are they all,
> in the worst case, linear?
>
> I thank you in advance...

Walid,

I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and tokenbased  camps I see popping up quite a bit in the literature are Jaro-  Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising.

There are scores of excellent papers out there, but two of my current favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks" available here: http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here: http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML/cv.html
http://www.cs.umd.edu/~getoor/
http://www.cs.cmu.edu/~wcohen/
http://www.cecs.csulb.edu/~monge/
http://www.research.att.com/~divesh/papers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler
_Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/stringmetrics.html http://secondstring.sourceforge.net/

Regards,

--Jeff Received on Thu Jun 19 2008 - 21:47:10 CDT

Original text of this message