Re: Approximate/Fuzzy Search for strings keys
Date: Fri, 20 Jun 2008 09:47:54 -0700 (PDT)
On Jun 19, 7:47 pm, jefftyzzer <jefftyz..._at_sbcglobal.net> wrote:
> On Jun 19, 9:36 am, walid.s..._at_gmail.com wrote:
> > Dear all,
> > I have read the literature on fuzzy/approximate search for strings
> > keys in large databases, but I could not find a clear answer on what
> > is the best that has been achieved thus far in terms of the cost of
> > search (regardless of cost of building any index).
> > I understand this depends on the application/domain/type of data, but,
> > to keep things simple, supoose the target data is some string (names,
> > addresses, or any sequence of characters), and assume any distance
> > function. I am simply asking this: what is the best that has been
> > achieved? Is there a logarithmic solution (i.e., that can retrieve a
> > set of "similar" objects in some logarithmic order), or are they all,
> > in the worst case, linear?
> > I thank you in advance...
> I am not aware of any single best algorithm, but I don't discount that
> there may be one. The ones from both the edit-distance- and token-
> based camps I see popping up quite a bit in the literature are Jaro-
> Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
> I've begun doing a fair amount of work on cleaning (both de-duping and
> standardizing) name (personal names and corporate/business names)
> data, and am finding the combination of Jaccard and Jaro-Winkler
> There are scores of excellent papers out there, but two of my current
> favorites are:
> "A Comparison of String Distance Metrics for Name-Matching Tasks"
> available here:http://tinyurl.com/4b8jao
> "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
> available here:http://tinyurl.com/3nt3vj
> Here are links to some researchers that are leaders in this area:
> Also, these books may help:
> _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
> _Data Quality_ by Batini and Scannapieco
> _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson
> Finally, some software:
This graph (at the first software site I list above) may be what you're ultimately looking for:
--Jeff Received on Fri Jun 20 2008 - 18:47:54 CEST