Re: Does MySQL have fuzzy matching or no?

From: Luuk <luuk_at_invalid.lan>
Date: Sat, 16 Dec 2017 15:36:28 +0100
Message-ID: <5a352f6b$0$9974$e4fe514c_at_news.xs4all.nl>


On 15-12-17 23:45, Axel Schwenke wrote:
> On 15.12.2017 21:20, Luuk wrote:

>> On 15-12-17 18:46, Lew Pitcher wrote:

>
>>>> If you are looking at strings that might sound similar if pronounced, then
>>>> you could use the SOUNDEX() function to match to a pre-determined value

>
>> What am i missing?
>>
>> [root_at_test]> select * from test15;
>> +------+
>> | test |
>> +------+
>> | doop |
>> | pood |
>> | poop |
>> +------+
>> 3 rows in set (0.00 sec)
>>
>> [root_at_test]> select * from test15 where test sounds like 'doop';
>> +------+
>> | test |
>> +------+
>> | doop |
>> +------+

>
> Nothing.
>
> SOUNDEX() is pretty poor when it comes to generic fuzzy matching. It works
> only for phonetically similar words (like "Meyer" vs. "Mayer"), only for
> English and even then it yields ... interesting ... results from time to time.
>
> N-gram matching on the other hand is a very nice tool to find similar texts
> - especially when those texts are rather long. I had big success with it
> when I consolidated my joke collection (collected from various sources
> around the - then new - internet). Back those days I wrote a tool in C
> (after having read an article about trigram matching in a computer mag) and
> it helped excellently in weeding out (near-)duplicates.
>
> Oracle added an n-gram parser to the FULLTEXT indexer for completely
> different reasons - to support languages that don't have the concept of
> separation characters in written text. But it will work nicely for fuzzy
> matching in western languages, too.
>

Ok, so writing a function which will give a higher result on better matches is maybe a good way to solve this.

Mymatch(match, against)
returnvalue:
The sum of
  1, or every matching character in the right place,   0.4, for every charachter matching in the wrong place.

This function should ('almost') do that: DROP FUNCTION IF EXISTS MyMatch;
delimiter //
CREATE FUNCTION MyMatch (v_match TEXT, v_against TEXT)

   RETURNS real
BEGIN
  DECLARE v_score REAL;
  DECLARE v_index INT;

  SET v_score = 0;

  SET v_index = 1;
  WHILE v_index <= LENGTH(v_against) DO
    IF substr(v_match,v_index,1) = substr(v_against,v_index,1) THEN

       SET v_score = v_score + 1.0;
    ELSE
       IF LOCATE(substr(v_match,v_index,1), concat(substr(v_against,1,v_index-1),substr(v_against,v_index+1)))>0 THEN

         SET v_score = v_score + 0.4;
       END IF;

    END IF;     SET v_index = v_index + 1;
  END WHILE;   RETURN v_score;

END//
delimiter ;

SELECT Mymatch('door','poop'),  2, '';
SELECT Mymatch('poop','poop'),  4, '';
SELECT Mymatch('poop','poor'),  3, 'wrong result 3.4';
SELECT Mymatch('roop','poor'),  2.8, '';
SELECT Mymatch('rood','poor'),  2.4, '';
Received on Sat Dec 16 2017 - 15:36:28 CET

Original text of this message