OraFAQ Forum: SQL & PL/SQL » match/merge steps

Home » SQL & PL/SQL » SQL & PL/SQL » match/merge steps (win7 32 bit / owb 11g r2)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

match/merge steps [message #639467]

Wed, 08 July 2015 04:46

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Hello,
Firsly, i've found out on the forum if there was an answer to my question without sussccess.
I'm trying to realize a match merge operations(count number of duplicate entries) from a big data (it is only one source) and I saw that Oracle Warehouse Builder is able to make it.
I'm a newbie on Oracle in general and even on the owb platform.
Pease I'd like to know if somone can indicate me a the process step by step from how to load the data (example from a flat file or a database) to merge them;
- The attributes for matching are : name, lastname (jarowinkler similarity); date of birth equal
- the match bins are : cityofbirth, stateofbirth
I've applied the original tutorial from the oracle website but it's confused for me at the moment to identify really the need to apply for my case.
Thanks in advance ;
Best regards.

Report message to a moderator

Re: match/merge steps [message #639484 is a reply to message #639467]

Wed, 08 July 2015 09:17

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

Hi,

Welcome to the forum!

Please read and follow the OraFAQ Forum Guide and How to use [code] tags, to enable us to help you.

I think what you are asking for is a manual, so did search for the online Oracle documentation?

Report message to a moderator

Re: match/merge steps [message #639485 is a reply to message #639467]

Wed, 08 July 2015 09:25

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

freestyle wrote on Wed, 08 July 2015 15:16

Pease I'd like to know if somone can indicate me a the process step by step from how to load the data (example from a flat file or a database) to merge them;
- The attributes for matching are : name, lastname (jarowinkler similarity); date of birth equal
- the match bins are : cityofbirth, stateofbirth

This is confusing. What makes you think you need OWB? Loading data into Oracle database could be done in numerous ways, depending on the data and environment. Is the data stored in flat file or in some other database(Oracle or non-Oracle would also matter)?

Report message to a moderator

Re: match/merge steps [message #639522 is a reply to message #639484]

Thu, 09 July 2015 03:36

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Hi Mr,
Thanks for the welcome.
Sorry for the error tags on the post, I'll try to improve my post presentation.
Yes I did search for the documentation online and I've also used to read manual.

[Updated on: Thu, 09 July 2015 03:39]

Report message to a moderator

Re: match/merge steps [message #639525 is a reply to message #639485]

Thu, 09 July 2015 03:49

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Thanks for the reply,
When I was finding online, I saw that OWB was one of the efficient tool that could help me do the work and had many possibilities.
I have to match the potential duplicates with name and firstName attributes (due to errors when typing) and to confirm those potential duplicate using the exact match for cityofbirth and demptofbirth
I have to make separately the match on two type of files. One is a csv file (or sql files). The second is an oracle table. I don't need to put them together.
If you any other idea to do it in another way, it would be welcome.
Thanks in advance.
Regards.
Ps in attach the csv's file example.

Attachment: membres.csv
(Size: 108.73KB, Downloaded 1248 times)

Report message to a moderator

Re: match/merge steps [message #639541 is a reply to message #639525]

Thu, 09 July 2015 06:15

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

I can't help you with OWB, but I can certainly help you with the data entry validation that you are talking about potential duplicates.

In Oracle, there is "UTL_MATCH" package which facilitates matching two records. It comes very handy for data entry validation and string matching. I have written a small article, have a look at UTL_MATCH string comparision technique in Oracle

Regards,
Lalit

Report message to a moderator

Re: match/merge steps [message #639545 is a reply to message #639541]

Thu, 09 July 2015 06:39

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Thanks very much for the link,
Aussiming that we use this attached file, please how can we write the sql instructions using the mentioned conditions ?
Thanks !

Report message to a moderator

Re: match/merge steps [message #639550 is a reply to message #639545]

Thu, 09 July 2015 09:32

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

Could you please provide the data as insert statements, post create statement too. I can't download external attachment at my workplace.

Report message to a moderator

Re: match/merge steps [message #639552 is a reply to message #639550]

Thu, 09 July 2015 11:17

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Thu, 09 July 2015 16:32

Could you please provide the data as insert statements, post create statement too. I can't download external attachment at my workplace.

Ok.
Here is the Insert statement:

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";
CREATE TABLE IF NOT EXISTS `membres` (
  `IDMEMBRE` int(11) NOT NULL,
  `Sexe` int(11) NOT NULL,
  `DateNaissance` varchar(12) NOT NULL,
  `Prenom` varchar(72) NOT NULL,
  `Nom` varchar(72) NOT NULL,
  `Comune` varchar(5) NOT NULL,
  `Departement` varchar(3) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO is here:

I hope that I used to respect the quote code ! Smile

Smile

[EDITED by LF: applied [spoiler] tags]

[Updated on: Fri, 10 July 2015 00:07] by Moderator

Report message to a moderator

Re: match/merge steps [message #639561 is a reply to message #639552]

Thu, 09 July 2015 22:20

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

That is not Oracle syntax.

Report message to a moderator

Re: match/merge steps [message #639566 is a reply to message #639561]

Fri, 10 July 2015 02:40

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Fri, 10 July 2015 05:20

That is not Oracle syntax.

Yes you're writght, her is the insert statements with the Oracle's syntax :

and I've also attached it to a file.

[EDITED by LF: applied [spoiler] tags]

Attachment: export_membres.sql
(Size: 311.40KB, Downloaded 1008 times)

[Updated on: Fri, 10 July 2015 03:17] by Moderator

Report message to a moderator

Re: match/merge steps [message #639570 is a reply to message #639566]

Fri, 10 July 2015 05:03

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

Those are too many rows for a test case. You need to understand how to post a test case.

Anyway, coming to your requirement:

freestyle wrote on Thu, 09 July 2015 14:19

I have to match the potential duplicates with name and firstName attributes (due to errors when typing) and to confirm those potential duplicate using the exact match for cityofbirth and demptofbirth

I don't see these columns in the table definition provided by you. From the below columns, can you please explain which two columns you want to compare for data validation?

        "ID_MEMBRE" NUMBER, 
	"SEXE" NUMBER, 
	"DATE_NAISS" VARCHAR2(20 BYTE), 
	"PRENOM" VARCHAR2(30 BYTE), 
	"NOM" VARCHAR2(30 BYTE), 
	"COMUNE_NAISS" NUMBER, 
	"DEPT_NAISS" VARCHAR2(4 BYTE)

A small example of JARO_WINKLER_SIMILARITY Function:

Setup:

DROP TABLE t_utl_match;

CREATE TABLE t_utl_match (
id   NUMBER,
col1 VARCHAR2(15),
col2 VARCHAR2(15))
);

INSERT INTO t_utl_match VALUES (1, 'Dunningham', 'Cunnigham');
INSERT INTO t_utl_match VALUES (2, 'Abroms', 'Abrams');
INSERT INTO t_utl_match VALUES (3, 'Lampley', 'Campley');
INSERT INTO t_utl_match VALUES (4, 'Marhta', 'Martha');
INSERT INTO t_utl_match VALUES (5, 'Jonathon', 'Jonathan');
INSERT INTO t_utl_match VALUES (6, 'Jeraldine', 'Geraldine');
INSERT INTO t_utl_match VALUES (7, 'Bat Man', 'Cat Woman');
COMMIT;

Query:

SQL> SELECT id,
2         text1,
3         text2,
4         UTL_MATCH.jaro_winkler_similarity(text1, text2) AS jws
5  FROM   t_utl_match
6  ORDER BY id;

ID TEXT1           TEXT2                  JWS
---------- --------------- --------------- ----------
1 Dunningham      Cunnigham               89
2 Abroms          Abrams                  92
3 Lampley         Campley                 90
4 Marhta          Martha                  96
5 Jonathon        Jonathan                95
6 Jeraldine       Geraldine               92
7 Bat Man         Cat Woman               75

7 rows selected.

SQL>

Report message to a moderator

Re: match/merge steps [message #639572 is a reply to message #639570]

Fri, 10 July 2015 05:47

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Quote:

I don't see these columns in the table definition provided by you. From the below columns, can you please explain which two columns you want to compare for data validation?

Sorry i've forgotten to translate the attributes in english; The cityofbirth in fact is the postalcode or zip code of birth and deptofbirth is deptcode of birth which can be retrieve easily.
If I specify add all the attributes It would BE like this :

  "ID_MEMBRE" NUMBER, 
	"SEX" NUMBER, 
	"DATE_OF_BIRTH" VARCHAR2(20 BYTE), 
	"FIRST_NAME" VARCHAR2(30 BYTE), 
	"NAME" VARCHAR2(30 BYTE), 
	"CITY_OF_BIRTH" NUMBER, 
	"DEPT_OF_BIRTH" VARCHAR2(4 BYTE)

Quote:

Those are too many rows for a test case. You need to understand how to post a test case

I thought that you wanted to see all the data on which i have to find duplicates; and that for you it'd more intersting to really check to results of the query.
So here is a test case some :

DROP TABLE tbl_membre;

CREATE TABLE tbl_membre(	"ID_MEMBRE" NUMBER, 
	"SEX" NUMBER, 
	"DATE_OF_BIRTH" VARCHAR2(20 BYTE), 
	"FIRST_NAME" VARCHAR2(30 BYTE), 
	"NAME" VARCHAR2(30 BYTE), 
	"CITY_OF_BIRTH" NUMBER, 
	"DEPT_OF_BIRTH" VARCHAR2(4 BYTE)
   );

Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('2','2','11/05/1955','JEANNE','BOLOREY','75012','75');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('3','2','05/11/1955','GEANE','BOLLOREY','75012','75');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('4','2','11/05/1955','JEANNE','BALORE','75012','75');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('8','2','25/04/1963','MARINE','FIABLE','91320','91');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('9','2','25/04/1963','MARYNE','FAIBLE','91320','91');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('10','2','25/04/1963','MAIRINE','BALORE','91320','91');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('18','2','20/12/1939','JACQLINE','LIPARK','44100','44');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('19','2','20/12/1939','JACQUELYNE','ILIPARC','44100','44');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('20','2','20/12/1939','JACQUELHINE','LIPARC','44100','44');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('42','2','07/08/1954','MALIKA','RINTET','75015','75');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('43','1','26/10/1931','RAUGER','GRISION','44300','44');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('44','1','26/10/1931','ROGER','GRISION','44300','44');
Insert into tbl_membre (ID_MEMBRE,SEX,DATE_OF_BIRTH,FIRST_NAME,NAM,CITY_OF_BIRTH,DEPT_OF_BIRTH) values ('45','1','26/10/1931','ROGER','GRESION','44300','44');
COMMIT;

Quote:

A small example of JARO_WINKLER_SIMILARITY Function:

I've seen these more interesting example; in this case you make a comparison with columns but in my case the difficulty for me is that i have to - Check for every record if they are not duplicate with these conditions:
- Check with jaro_winkler between record if name and firstname are or not duplicate(if similarity>90)
Apply the other condition to check if CITY_OF_BIRTH or DEPT_OF_BIRTH are equals or not to confirm the potential duplicates.
that's make me ask myself how to do it plz?

If it is posiible to have it
Thanks in advance.

[Updated on: Fri, 10 July 2015 05:49]

Report message to a moderator

Re: match/merge steps [message #639574 is a reply to message #639572]

Fri, 10 July 2015 07:09

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

So you are not comparing two columns, but different rows of same column? Is that what you mean? Like 'JEANNE' and 'GEANE'?

Report message to a moderator

Re: match/merge steps [message #639575 is a reply to message #639574]

Fri, 10 July 2015 07:19

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Fri, 10 July 2015 14:09

So you are not comparing two columns, but different rows of same column? Is that what you mean? Like 'JEANNE' and 'GEANE'?

It's exactly why i'm trying to do Sir !
With sql or PL/SQL, i'm limited.
Any idea would be welcome.
Thanks

Report message to a moderator

Re: match/merge steps [message #639576 is a reply to message #639575]

Fri, 10 July 2015 08:33

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

In this case, you need to self join the table on CITY_OF_BIRTH and DEPT_OF_BIRTH and compare the rows from both using jaro winkler technique as shown above. This should be your subquery, and in the outer query filter the rows based on the value of UTL_MATCH.jaro_winkler_similarity. It should be a simple query.

Report message to a moderator

Re: match/merge steps [message #639641 is a reply to message #639576]

Mon, 13 July 2015 07:24

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Fri, 10 July 2015 15:33

In this case, you need to self join the table on CITY_OF_BIRTH and DEPT_OF_BIRTH and compare the rows from both using jaro winkler technique as shown above. This should be your subquery, and in the outer query filter the rows based on the value of UTL_MATCH.jaro_winkler_similarity. It should be a simple query.

Thanks for your suggestion; Do you mean something like the following ::

SELECT TM1.name_mbr, TM1.first_name, TM1.date_of_birth, count(*) AS Nbr_mbr
FROM tbl_membres TM1 JOIN tbl_membres TM2
ON  tm1.city_of_birth = tm2.city_of_birth AND tm1.dept_of_birth = tm2.dept_of_birth 
  WHERE EXISTS ( SELECT name_mbr, first_name
          FROM tbl_membres
         WHERE (UTL_MATCH.jaro_winkler_similarity(tm1.first_name, tm2.first_name) >80) AND
                (UTL_MATCH.jaro_winkler_similarity(tm1.name_mbr, tm2.name_mbr) >80 ))
 GROUP BY    TM1.name_mbr, TM1.first_name, TM1.date_of_birth            
 HAVING count(*) > 1
 ORDER BY name_mbr;

I've Tried many kind of test but the only one which gives me some results is this one but the output results are not corrected like you can check with the test case.
How does the sub-query you mentionned is supposed to be please ?

[Updated on: Mon, 13 July 2015 10:39]

Report message to a moderator

Re: match/merge steps [message #639685 is a reply to message #639641]

Tue, 14 July 2015 04:06

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

Try this:

SQL> WITH t1 AS
  2    (SELECT t.*,
  3      row_number() OVER(PARTITION BY city_of_birth, dept_of_birth ORDER BY id_membre) rn
  4    FROM tbl_membre t
  5    ),
  6    t2 AS
  7    (SELECT t.*,
  8      row_number() OVER(PARTITION BY city_of_birth, dept_of_birth ORDER BY id_membre) rn
  9    FROM tbl_membre t
 10    )
 11  SELECT t1.first_name, t1.nam
 12  FROM t1,
 13    t2
 14  WHERE t1.city_of_birth                                              = t2.city_of_birth
 15  AND t1.dept_of_birth                                                = t2.dept_of_birth
 16  AND t1.first_name                                                  <> t2.first_name
 17  AND t1.nam                                                         <> t2.nam
 18  AND t1.rn                                                          <> t2.rn
 19  AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) > 80 --AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) <100
 20  AND UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam)               > 80 --and UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam) < 100
 21  ORDER BY t1.id_membre;

FIRST_NAME                     NAM
------------------------------ ------------------------------
JEANNE                         BOLOREY
GEANE                          BOLLOREY
MARINE                         FIABLE
MARYNE                         FAIBLE
JACQLINE                       LIPARK
JACQUELYNE                     ILIPARC
JACQUELHINE                    LIPARC
JACQUELHINE                    LIPARC
RAUGER                         GRISION
ROGER                          GRESION

10 rows selected.

SQL>

Regards,
Lalit

Report message to a moderator

Re: match/merge steps [message #639686 is a reply to message #639685]

Tue, 14 July 2015 04:08

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

I am moving this topic to SQL forum as it seems to be more appropriate there rather than Warehouse Builder.

Report message to a moderator

Re: match/merge steps [message #639746 is a reply to message #639686]

Wed, 15 July 2015 03:45

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Tue, 14 July 2015 11:08

I am moving this topic to SQL forum as it seems to be more appropriate there rather than Warehouse Builder.

Hi,
It's ok for me to move the topic to SQL subjects.
Thanks.

Report message to a moderator

Re: match/merge steps [message #639747 is a reply to message #639746]

Wed, 15 July 2015 03:47

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

Did you try the above solution? Does it work for you?

Report message to a moderator

Re: match/merge steps [message #639749 is a reply to message #639685]

Wed, 15 July 2015 04:12

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Tue, 14 July 2015 11:06

Try this:

SQL> WITH t1 AS
  2    (SELECT t.*,
  3      row_number() OVER(PARTITION BY city_of_birth, dept_of_birth ORDER BY id_membre) rn
  4    FROM tbl_membre t
  5    ),
  6    t2 AS
  7    (SELECT t.*,
  8      row_number() OVER(PARTITION BY city_of_birth, dept_of_birth ORDER BY id_membre) rn
  9    FROM tbl_membre t
 10    )
 11  SELECT t1.first_name, t1.nam
 12  FROM t1,
 13    t2
 14  WHERE t1.city_of_birth                                              = t2.city_of_birth
 15  AND t1.dept_of_birth                                                = t2.dept_of_birth
 16  AND t1.first_name                                                  <> t2.first_name
 17  AND t1.nam                                                         <> t2.nam
 18  AND t1.rn                                                          <> t2.rn
 19  AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) > 80 --AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) <100
 20  AND UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam)               > 80 --and UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam) < 100
 21  ORDER BY t1.id_membre;

FIRST_NAME                     NAM
------------------------------ ------------------------------
JEANNE                         BOLOREY
GEANE                          BOLLOREY
MARINE                         FIABLE
MARYNE                         FAIBLE
JACQLINE                       LIPARK
JACQUELYNE                     ILIPARC
JACQUELHINE                    LIPARC
JACQUELHINE                    LIPARC
RAUGER                         GRISION
ROGER                          GRESION

10 rows selected.

SQL>

Regards,
Lalit

Tkanks so much M. Lalit for this first suggestion you gave me. from yesterday nigth I'm trying to understand better the results the query outputs. It's firstly interesting ! From what I'm analysing, the query find all the potential duplicates, taking into account the conditions set(which were supposed initially);
In one part, some results are really good but if we look well at the result, we'll see that there are some anormal rows :
- Firstly in the test rows, we have only one JACQUELHINE LIPARC and the output gives us two rows for the same name. I think it isn't normal (even if we can resolve by adding the DISTINCT clause); Don't you think so ?
- Is it normal that the row ROGER GRISION is missing ? (jaro_winkler_similarity(ROGER, RAUGER) = 84

In the other part (We can look it after if necessary but I thought it'd important), I add the output of number of rows' column that inserted to look if it matches the number of duplicates, because it gave me some 'strange' Cool

Cool

answers for some rows and it's look normal for others :

WITH t1 AS
(SELECT t.*, 
  row_number() over(PARTITION BY city_of_birth, dept_of_birth ORDER BY id_membre) rn
  FROM test_tbl_mbr t
  ),
  t2 AS
  (SELECT t.*, 
  row_number() OVER(PARTITION BY city_of_birth, dept_of_birth ORDER BY id_membre) rn  
  FROM test_tbl_mbr t
  )  
  SELECT T1.first_name, T1.nam, t1.rn as number_of_dup_rows
  FROM T1, T2
  WHERE T1.city_of_birth  =     T2.city_of_birth
  AND   T1.dept_of_birth  =     T2.dept_of_birth
  AND   T1.first_name     <>    T2.first_name
  AND   T1.nam          <>    T2.nam
  AND   T1.rn             <>    T2.rn
  AND   UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) > 80 --AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) <100
  AND   UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam) > 80 --AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) <100
  ORDER BY T1.id_membre;

And we have this result:

FIRST_NAME                     NAM                            NUMBER_OF_DUP_ROWS     
------------------------------ ------------------------------ ---------------------- 
JEANNE                         BOLOREY                        1                      
GEANE                          BOLLOREY                       2                      
MARINE                         FIABLE                         1                      
MARYNE                         FAIBLE                         2                      
JACQLINE                       LIPARK                         1                      
JACQUELYNE                     ILIPARC                        2                      
JACQUELHINE                    LIPARC                         3                      
JACQUELHINE                    LIPARC                         3                      
RAUGER                         GRISION                        1                      
ROGER                          GRESION                        3                      

10 rows selected

Assuming that we need to output all the potential duplicates (like it appears)Does it count number of rows for each potential group of duplicates ?
What is your point of vue please ?
Thanks again.

[Updated on: Wed, 15 July 2015 05:11]

Report message to a moderator

Re: match/merge steps [message #639752 is a reply to message #639749]

Wed, 15 July 2015 05:09

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

freestyle wrote on Wed, 15 July 2015 14:42

Does it count number of rows for each potential group of duplicates ?

it doesn't count the rows for you, you need to write your own logic based on the jaro_winkler_similarity. What the query does is, it compares the value with all other rows in each group therefore you will have one row with 100% match. This duplicate is filtered by the NOT EQUAL condition for first_name, nam and rn.

It is up to you to filter on the basis of the jaro_winkler_similarity, Oracle just calculates and gives the comparison result.

Report message to a moderator

Re: match/merge steps [message #639755 is a reply to message #639752]

Wed, 15 July 2015 05:28

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Lalit Kumar B wrote on Wed, 15 July 2015 12:09

freestyle wrote on Wed, 15 July 2015 14:42
Does it count number of rows for each potential group of duplicates ?

it doesn't count the rows for you, you need to write your own logic based on the jaro_winkler_similarity. What the query does is, it compares the value with all other rows in each group therefore you will have one row with 100% match. This duplicate is filtered by the NOT EQUAL condition for first_name, nam and rn.

It is up to you to filter on the basis of the jaro_winkler_similarity, Oracle just calculates and gives the comparison result.

I'm ok with what your said; I understand that it doesn't count row, but if it is supposed to 100% match :
- Why don't we have the row ROGER GRISION (jaro_winkler_similarity(ROGER, RAUGER)=84 which is >80 condition ) which is missing ?
- Why do we have the row JACQUELYNE ILIPARC (jaro_winkler_similarity(ILIPARC, LIPARK)=78 which is <80 condition ?
- We have one identical additional row JACQUELHINE LIPARC that is not supposed to be present; right ?
Sorry but I'm trying to undestand.

[Updated on: Wed, 15 July 2015 05:29]

Report message to a moderator

Re: match/merge steps [message #639758 is a reply to message #639755]

Wed, 15 July 2015 05:36

Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web

Senior Member

That query could be simplified to the following without using Analytic function. I am including the required columns for you to understand that those are actually not duplicates but being compared to two different rows.

SQL> SELECT t1.first_name,
  2    t2.first_name compared_to_fst_nm,
  3    UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) jws_fst_nm,
  4    t1.nam,
  5    t2.nam compared_to_nm,
  6    UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam) jws_nm
  7  FROM tbl_membre t1,
  8    tbl_membre t2
  9  WHERE t1.city_of_birth                                              = t2.city_of_birth
 10  AND t1.dept_of_birth                                                = t2.dept_of_birth
 11  AND t1.first_name                                                  <> t2.first_name
 12  AND t1.nam                                                         <> t2.nam
 13  AND UTL_MATCH.jaro_winkler_similarity(t1.first_name, t2.first_name) > 80
 14  AND UTL_MATCH.jaro_winkler_similarity(t1.nam, t2.nam)               > 80
 15  ORDER BY t1.id_membre;

FIRST_NAME   COMPARED_TO_FST JWS_FST_NM NAM          COMPARED_TO_NM                     JWS_NM
------------ --------------- ---------- ------------ ------------------------------ ----------
JEANNE       GEANE                   82 BOLOREY      BOLLOREY                               97
GEANE        JEANNE                  82 BOLLOREY     BOLOREY                                97
MARINE       MARYNE                  92 FIABLE       FAIBLE                                 95
MARYNE       MARINE                  92 FAIBLE       FIABLE                                 95
JACQLINE     JACQUELHINE             89 LIPARK       LIPARC                                 93
JACQUELYNE   JACQUELHINE             94 ILIPARC      LIPARC                                 89
JACQUELHINE  JACQUELYNE              94 LIPARC       ILIPARC                                89
JACQUELHINE  JACQLINE                89 LIPARC       LIPARK                                 93
RAUGER       ROGER                   84 GRISION      GRESION                                87
ROGER        RAUGER                  84 GRESION      GRISION                                87

10 rows selected.

SQL>

Report message to a moderator

Re: match/merge steps [message #639762 is a reply to message #639758]

Wed, 15 July 2015 06:18

freestyle
Messages: 19
Registered: July 2015
Location: Bordeaux

Junior Member

Sincerely I thought that the query helps me to already individuate the potential duplicate(I'm dummy and confused Sad

Sad

), so if it only compares rows, that's make me turn to my initial problem: With this test case, please how would be the potential duplicate assuming we use initial condition ?

Report message to a moderator

Previous Topic:	Merge using row order
Next Topic:	Extraction from BLOB data type

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Thu Apr 25 01:24:21 CDT 2024