Re: Approximate/Fuzzy Search for strings keys

From: jefftyzzer <jefftyzzer_at_sbcglobal.net>
Date: Fri, 20 Jun 2008 09:47:54 -0700 (PDT)
Message-ID: <bb59b724-0716-458d-8fba-6edbeebcfebd_at_j1g2000prb.googlegroups.com>


On Jun 19, 7:47 pm, jefftyzzer <jefftyz..._at_sbcglobal.net> wrote:
> On Jun 19, 9:36 am, walid.s..._at_gmail.com wrote:
>
>
>
> > Dear all,
>
> > I have read the literature on fuzzy/approximate search for strings
> > keys in large databases, but I could not find a clear answer on what
> > is the best that has been achieved thus far in terms of the cost of
> > search (regardless of cost of building any index).
>
> > I understand this depends on the application/domain/type of data, but,
> > to keep things simple, supoose the target data is some string (names,
> > addresses, or any sequence of characters), and assume any distance
> > function. I am simply asking this: what is the best that has been
> > achieved? Is there a logarithmic solution (i.e., that can retrieve a
> > set of "similar" objects in some logarithmic order), or are they all,
> > in the worst case, linear?
>
> > I thank you in advance...
>
> Walid,
>
> I am not aware of any single best algorithm, but I don't discount that
> there may be one. The ones from both the edit-distance- and token-
> based camps I see popping up quite a bit in the literature are Jaro-
> Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
> Fellegi-Sunter.
>
> I've begun doing a fair amount of work on cleaning (both de-duping and
> standardizing) name (personal names and corporate/business names)
> data, and am finding the combination of Jaccard and Jaro-Winkler
> promising.
>
> There are scores of excellent papers out there, but two of my current
> favorites are:
>
> "A Comparison of String Distance Metrics for Name-Matching Tasks"
> available here:http://tinyurl.com/4b8jao
>
> --and--
>
> "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
> available here:http://tinyurl.com/3nt3vj
>
> Here are links to some researchers that are leaders in this area:
>
> http://www.cs.umd.edu/~indrajit/HTML/cv.htmlhttp://www.cs.umd.edu/~getoor/http://www.cs.cmu.edu/~wcohen/http://www.cecs.csulb.edu/~monge/http://www.research.att.com/~divesh/papers/index.php
>
> Also, these books may help:
>
> _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
> Winkler
> _Data Quality_ by Batini and Scannapieco
> _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson
>
> Finally, some software:
>
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htmlhttp://secondstring.sourceforge.net/
>
> Regards,
>
> --Jeff

Walid:

This graph (at the first software site I list above) may be what you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/images/string_metrics_comparison.jpg

--Jeff Received on Fri Jun 20 2008 - 18:47:54 CEST

Original text of this message