Re: Approximate String Matching in RDBMS
Date: Tue, 1 Feb 2011 11:07:53 -0800 (PST)
On Jan 19, 7:07 am, PRAGMATECH <walid.s..._at_gmail.com> wrote:
> Dear All,
> Does someone know what is the best theoretical complexity in the
> literature, as O(f(N)), where N is the number of rows in a table t
> where approximate string matching on some field in t is to be done.
> I am not asking about the complexity of a single approximate match
> between 2 strings with lengths n and m, but on the cost of the lookup
> over the entire table.
> In other words, what is the cost of the following, in terms of N, the
> number of rows in t:
> SELECT * FROM t WHERE APPROXIMATELY_EQUAL(t.a, threshold)
> I thank you in advance for any feedback.
> Best Regards,
> Walid Saba
I may be misunderstanding your question, so apologies if that's the case. The naive but arguably most straightforward method of approximate matching is pair-wise, i.e., a self join and test of the two comparands against a threshold, e.g.:
SELECT t1.key1, t2.key1
FROM t t1, t t2
WHERE APPROXIMATELY_EQUAL(t1.a, t2.a) >= threshold
APPROXIMATELY_EQUAL could be Levenshtein, Jaro-Winkler, Jaccard, Cosine, or The New Walid Algorithm, among others.
Such a query realizes a Cartesian product, and the cost of comparing each row to every other row is a quadratic time process, i.e., N^2.
You can reduce the potential size of this result set by adding an additional predicate to filter out rows joined to themselves (reflexive rows), as well as reciprocals (symmetric rows), such as "t1.key1 > t2.key1." Thus, the number of theoretically matched and *returned* rows is reduced from N^2 to N(N-1)/2.
There are oodles of algorithms out there that describe methods to reduce the complexity of the pair-wise comparison process. One of the most venerable is the sorted neighborhood method. Its cost is Nlog2N, with N being the size of the sliding window (the "neighborhood").
--Jeff Received on Tue Feb 01 2011 - 13:07:53 CST