Re: Does MySQL have fuzzy matching or no?
Date: Sat, 16 Dec 2017 15:36:28 +0100
Message-ID: <5a352f6b$0$9974$e4fe514c_at_news.xs4all.nl>
On 15-12-17 23:45, Axel Schwenke wrote:
> On 15.12.2017 21:20, Luuk wrote:
>> On 15-12-17 18:46, Lew Pitcher wrote:
>
>>>> If you are looking at strings that might sound similar if pronounced, then >>>> you could use the SOUNDEX() function to match to a pre-determined value
>
>> What am i missing? >> >> [root_at_test]> select * from test15; >> +------+ >> | test | >> +------+ >> | doop | >> | pood | >> | poop | >> +------+ >> 3 rows in set (0.00 sec) >> >> [root_at_test]> select * from test15 where test sounds like 'doop'; >> +------+ >> | test | >> +------+ >> | doop | >> +------+
>
> Nothing.
>
> SOUNDEX() is pretty poor when it comes to generic fuzzy matching. It works
> only for phonetically similar words (like "Meyer" vs. "Mayer"), only for
> English and even then it yields ... interesting ... results from time to time.
>
> N-gram matching on the other hand is a very nice tool to find similar texts
> - especially when those texts are rather long. I had big success with it
> when I consolidated my joke collection (collected from various sources
> around the - then new - internet). Back those days I wrote a tool in C
> (after having read an article about trigram matching in a computer mag) and
> it helped excellently in weeding out (near-)duplicates.
>
> Oracle added an n-gram parser to the FULLTEXT indexer for completely
> different reasons - to support languages that don't have the concept of
> separation characters in written text. But it will work nicely for fuzzy
> matching in western languages, too.
>
Ok, so writing a function which will give a higher result on better matches is maybe a good way to solve this.
Mymatch(match, against)
returnvalue:
The sum of
1, or every matching character in the right place,
0.4, for every charachter matching in the wrong place.
This function should ('almost') do that:
DROP FUNCTION IF EXISTS MyMatch;
delimiter //
CREATE FUNCTION MyMatch (v_match TEXT, v_against TEXT)
RETURNS real
BEGIN
DECLARE v_score REAL;
DECLARE v_index INT;
SET v_score = 0;
SET v_index = 1;
WHILE v_index <= LENGTH(v_against) DO
IF substr(v_match,v_index,1) = substr(v_against,v_index,1) THEN
SET v_score = v_score + 1.0;
ELSE
IF LOCATE(substr(v_match,v_index,1),
concat(substr(v_against,1,v_index-1),substr(v_against,v_index+1)))>0 THEN
SET v_score = v_score + 0.4; END IF;
END IF; SET v_index = v_index + 1;
END WHILE; RETURN v_score;
END//
delimiter ;
SELECT Mymatch('door','poop'), 2, ''; SELECT Mymatch('poop','poop'), 4, ''; SELECT Mymatch('poop','poor'), 3, 'wrong result 3.4'; SELECT Mymatch('roop','poor'), 2.8, ''; SELECT Mymatch('rood','poor'), 2.4, '';Received on Sat Dec 16 2017 - 15:36:28 CET