Home » SQL & PL/SQL » SQL & PL/SQL » Grouping similar strings together using UTL_MATCH Jaro-Winkler (12.1.0.1 )
Grouping similar strings together using UTL_MATCH Jaro-Winkler [message #689129] |
Wed, 27 September 2023 23:39  |
 |
OraFerro
Messages: 426 Registered: July 2011
|
Senior Member |
|
|
Hi,
I have a table of strings (basically partner names) with a lot of duplication due to double entries of the same name with little variation. I need to clean the data by providing the user with groups of strings (partner names) that are similar.
create table test_jaro
(
id number(4) primary key,
string varchar2(30)
);
insert all
into test_jaro (id, string) values (1, 'My name is Fe"rro')
into test_jaro (id, string) values (2, 'My nameis Ferro')
into test_jaro (id, string) values (3, 'My name is Ferro')
into test_jaro (id, string) values (4, 'My n ame is Ferro')
into test_jaro (id, string) values (5, 'My n/ame is Ferro')
into test_jaro (id, string) values (6, 'My name is Ferro')
into test_jaro (id, string) values (7, 'Or1a FAQ')
into test_jaro (id, string) values (8, 'Ora FAQ1')
into test_jaro (id, string) values (9, 'Ora FAQ')
into test_jaro (id, string) values (10, 'Ora FAQ')
into test_jaro (id, string) values (11, 'Ora/ FAQ')
into test_jaro (id, string) values (12, 'Ora/ FAQ')
into test_jaro (id, string) values (13, 'a string')
select * from dual;
strings with 90% or above similarity are grouped together and a group id (string_group) is shows together with all mathing string ids (matching_string_ids) so that the final output would be:
string_group matching_string_ids
1 1
1 2
1 3
1 4
1 5
1 6
2 7
2 8
2 9
2 10
2 11
2 12
3 13
Thanks,
Ferro
|
|
|
Re: Grouping similar strings together using UTL_MATCH Jaro-Winkler [message #689130 is a reply to message #689129] |
Thu, 28 September 2023 00:02   |
 |
OraFerro
Messages: 426 Registered: July 2011
|
Senior Member |
|
|
Please see below my trivial trial and of course the performance is disastrous with larger samples:
select * from
(
select * FROM
(
SELECT
t1.id AS string_id,
t1.string AS base_string,
t2.id AS matching_id,
t2.string AS matching_string,
UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
FROM
test_jaro t1
JOIN
test_jaro t2
ON
t1.id < t2.id
)
WHERE
jaro_simi >= 90
)
where string_id not in
(
select matching_id from
(
select * FROM
(
SELECT
t1.id AS string_id,
t1.string AS base_string,
t2.id AS matching_id,
t2.string AS matching_string,
UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
FROM
test_jaro t1
JOIN
test_jaro t2
ON
t1.id < t2.id
)
WHERE
jaro_simi >= 90
)
);
[Updated on: Thu, 28 September 2023 00:22] Report message to a moderator
|
|
|
|
|
|
Re: Grouping similar strings together using UTL_MATCH Jaro-Winkler [message #689179 is a reply to message #689178] |
Sat, 30 September 2023 09:31   |
 |
Michel Cadot
Messages: 68550 Registered: March 2007 Location: Nanterre, France, http://...
|
Senior Member Account Moderator |
|
|
Quote:@Michel Cadot: what about switching between characters by mistake creating a new string (i.e. string ids: 7
In this particular case, TRANSLATE remove all non-alphabetical characters so no problem.
Quote:In the real case letters from French and German exist in addition to Arabic and English and punctuations are needed
Nothing prevent you from converting with TRANSLATE any character to any other one.
In one of my application I use the following to convert, among others, French characters to US7ASCII one:
translate (string,
'ƒŠŒŽšœžŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïøðñòóôõöùúûüýÿÞþ ',
'fSEZsezYAAAAAAECEEEEIIIIDNOOOOOOUUUUYBaaaaaaeceeeeiiiioonooooouuuuyy')
You can then remove, in an outer TRANSLATE, the unwanted characters as I showed there.
The point is that this does not handle swapped characters, then a similarity algorithm will handle this more accurately than on the original value.
|
|
|
Re: Grouping similar strings together using UTL_MATCH Jaro-Winkler [message #689180 is a reply to message #689179] |
Sun, 01 October 2023 00:53   |
 |
OraFerro
Messages: 426 Registered: July 2011
|
Senior Member |
|
|
@Michel
Quote:
The point is that this does not handle swapped characters
Exactly, which is what I meant by "switching two letters (or more) by mistake". In this case, if Quote:similarity algorithm will handle this more accurately , then what is the best technique/approach? the one I suggested has two options:
1- the inner query:
select * FROM
(
SELECT
t1.id AS string_id,
t1.string AS base_string,
t2.id AS matching_id,
t2.string AS matching_string,
UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
FROM
test_jaro t1
JOIN
test_jaro t2
ON
t1.id < t2.id
)
WHERE
jaro_simi >= 90
which produces logic results but the scope is very lengthy and repeating the string more than once confuses the average user.
2- remove repetition and group similar strings by the first occurrence but this will add an assumption that can be wrong and is very expensive in terms of execution time especially with large data.
Please tell me what you would recommend.
Regards,
Ferro
|
|
|
Re: Grouping similar strings together using UTL_MATCH Jaro-Winkler [message #689181 is a reply to message #689180] |
Sun, 01 October 2023 03:57  |
 |
Barbara Boehmer
Messages: 9070 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The following produces the same results as your query, but in a slightly different format. It eliminates the possibility of having things that are not closely related in the same group, due to a chain of relations, by comparing collections and only eliminating those that are a subset of another. Due to the sample data, the results happen to be the same here.
SCOTT@orcl_12.1.0.2.0> CREATE OR REPLACE TYPE ids AS TABLE OF NUMBER(4);
2 /
Type created.
SCOTT@orcl_12.1.0.2.0> CREATE OR REPLACE TYPE strings AS TABLE OF VARCHAR2(30);
2 /
Type created.
SCOTT@orcl_12.1.0.2.0> WITH
2 q AS
3 (SELECT t1.id AS string_id,
4 t1.string AS base_string,
5 CAST (COLLECT (t2.id ORDER BY t2.id) AS ids) AS matching_ids,
6 CAST (COLLECT (t2.string ORDER BY t2.id) AS strings) AS matching_strings
7 FROM test_jaro t1, test_jaro t2
8 WHERE t1.id < t2.id
9 AND UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) > 90
10 GROUP BY t1.id, t1.string)
11 SELECT q1.*
12 FROM q q1
13 WHERE NOT EXISTS
14 (SELECT *
15 FROM q q2
16 WHERE q1.string_id != q2.string_id
17 AND q1.matching_strings SUBMULTISET OF q2.matching_strings)
18 /
STRING_ID BASE_STRING MATCHING_IDS
---------- ------------------------------ ------------------------------
MATCHING_STRINGS
----------------------------------------------------------------------------------------------------------------------------------
1 My name is Fe"rro IDS(2, 3, 4, 5, 6)
STRINGS('My nameis Ferro', 'My name is Ferro', 'My n ame is Ferro', 'My n/ame is Ferro', 'My name is Ferro')
7 Or1a FAQ IDS(8, 9, 10, 11, 12)
STRINGS('Ora FAQ1', 'Ora FAQ', 'Ora FAQ', 'Ora/ FAQ', 'Ora/ FAQ')
2 rows selected.
|
|
|
Goto Forum:
Current Time: Thu Dec 07 01:47:00 CST 2023
|