Re: Approximate/Fuzzy Search for strings keys
Date: Thu, 19 Jun 2008 19:47:10 -0700 (PDT)
Message-ID: <032a2bc9-9c0a-494c-ad3c-34da0181c500@x1g2000prh.googlegroups.com>
On Jun 19, 9:36 am, walid.s..._at_gmail.com wrote:
> Dear all,
>
> I have read the literature on fuzzy/approximate search for strings
> keys in large databases, but I could not find a clear answer on what
> is the best that has been achieved thus far in terms of the cost of
> search (regardless of cost of building any index).
>
> I understand this depends on the application/domain/type of data, but,
> to keep things simple, supoose the target data is some string (names,
> addresses, or any sequence of characters), and assume any distance
> function. I am simply asking this: what is the best that has been
> achieved? Is there a logarithmic solution (i.e., that can retrieve a
> set of "similar" objects in some logarithmic order), or are they all,
> in the worst case, linear?
>
> I thank you in advance...
Walid,
I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and tokenbased camps I see popping up quite a bit in the literature are Jaro- Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter.
I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising.
There are scores of excellent papers out there, but two of my current favorites are:
"A Comparison of String Distance Metrics for Name-Matching Tasks" available here: http://tinyurl.com/4b8jao
--and--
"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here: http://tinyurl.com/3nt3vj
Here are links to some researchers that are leaders in this area:
http://www.cs.umd.edu/~indrajit/HTML/cv.html http://www.cs.umd.edu/~getoor/ http://www.cs.cmu.edu/~wcohen/ http://www.cecs.csulb.edu/~monge/ http://www.research.att.com/~divesh/papers/index.php
Also, these books may help:
_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson
Finally, some software:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html http://secondstring.sourceforge.net/
Regards,
--Jeff Received on Thu Jun 19 2008 - 21:47:10 CDT
