Re: Approximate/Fuzzy Search for strings keys

From: user923005 <dcorbit_at_connx.com>
Date: Thu, 19 Jun 2008 13:50:44 -0700 (PDT)
Message-ID: <b8b8adb0-9f16-4048-9ae8-e8410cdb3e52_at_m73g2000hsh.googlegroups.com>


On Jun 19, 9:36 am, walid.s..._at_gmail.com wrote:
> Dear all,
>
> I have read the literature on fuzzy/approximate search for strings
> keys in large databases, but I could not find a clear answer on what
> is the best that has been achieved thus far in terms of the cost of
> search (regardless of cost of building any index).
>
> I understand this depends on the application/domain/type of data, but,
> to keep things simple, supoose the target data is some string (names,
> addresses, or any sequence of characters), and assume any distance
> function. I am simply asking this: what is the best that has been
> achieved? Is there a logarithmic solution (i.e., that can retrieve a
> set of "similar" objects in some logarithmic order), or are they all,
> in the worst case, linear?

Without a clear definition of exactly what you are trying to accomplish, your question cannot be answered.

If I have a hash table of strings, with no duplicates allowed, and my hash table is large enough, then insert, update, and delete are all O(1) for exact strings.

Are the strings short or long?
What are the rules for similarity?
Are you looking at soundex sort of similarity or Levenstein edit distance or something else?
Can we have exact duplicate strings stored in our repository? What is the domain data? Personal names? DNA sequences? Do you expect to return the entire class in the case of a fuzzy match? I do not think you can expect an exact answer with a fuzzy question (even when the topic of the question is fuzzy). I guess if you specify the problem exactly you will get a better answer. Received on Thu Jun 19 2008 - 22:50:44 CEST

Original text of this message