PL/SQL [message #598542] |
Tue, 15 October 2013 12:52 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
Hello,
I have an urgent requirement to identify name and address matching from a large table 2 to 3 crores.
Please help if any fastest method can be used for identifying the data duplication.
I have already used JARO_WINKLER and EDIT_DISTANCE algorithms but these are very very slow on large data.
[Updated on: Tue, 15 October 2013 12:52] Report message to a moderator
|
|
|
|
Re: PL/SQL [message #598545 is a reply to message #598544] |
Tue, 15 October 2013 13:09 |
|
ranki
Messages: 19 Registered: October 2013
|
Junior Member |
|
|
MY TABLE
ID
NAME
MIDDLE_NAME
LAST_NAME
ADDRESS1
ADDRESS2
CITY
STATE
Thank you for quick response.I am trying to black list customer having more than 5 ids/accounts
I want to identify name or customer having multiple account/ids within the same table.
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY (name||' '||middle_name||' '||last_name,name||middle_name||last_name) jws_name,UTL_MATCH.EDIT_DISTANCE_SIMILARITY (
address1||' '||ADDRESS2||' '||CITY,ADDRESS1||ADDRESS||CITY) jws_address from my_table where UTL_MATCH.EDIT_DISTANCE_SIMILARITY(name||' '||middle_name||' '||last_name,name||middle_name||last_name)>90 and UTL_MATCH.EDIT_DISTANCE_SIMILARITY (ADDRESS1||' '||ADDRESS2||' '||CITY,ADDRESS1||ADDRESS2||CITY)>90);
result data like this
ANJALI DESHPANDE
AJAALI DESHPAANDE
similarly address must match for the same person .
[Updated on: Tue, 15 October 2013 13:11] Report message to a moderator
|
|
|
|
|
Re: PL/SQL [message #598563 is a reply to message #598562] |
Tue, 15 October 2013 13:59 |
Lalit Kumar B
Messages: 3174 Registered: May 2013 Location: World Wide on the Web
|
Senior Member |
|
|
Let me understand your requirement correctly. You need a SQL to identify the duplicates based on some rule, right?
|
|
|
|
|
Re: PL/SQL [message #598576 is a reply to message #598572] |
Tue, 15 October 2013 14:51 |
Lalit Kumar B
Messages: 3174 Registered: May 2013 Location: World Wide on the Web
|
Senior Member |
|
|
You are making simple things complex. When you crores, I assume it is multiples of ten million of records. That is not huge enough to take hours to return you the output. Number of records is not a measure of size.
It is high time to provide your test case, and show us what exactly is happening. Use SQL*Plus and paste the output, and let's see, if your query really takes hours. Now I don't mean you have to give us the real data, but, the create table statements and insert statements with test data and a working test case. If you are struggling with writing a query, we will help with it too.
|
|
|
|
|
|
Re: PL/SQL [message #598610 is a reply to message #598608] |
Wed, 16 October 2013 01:30 |
pablolee
Messages: 2882 Registered: May 2007 Location: Scotland
|
Senior Member |
|
|
Then you should reset the team's expectations. As Blackswan said, what you are trying to do is time consuming. Yes, there is software out there that can do it faster, it is specialised and doesn't do most of what Oracle can do. It's about picking the right tool for the job. If you want to de-dupe a lot of data using fuzzy matching rules, quickly, use a piece of software that is specifically designed for the task. If you choose not to, then you/your team will have to accept that it will take a bit longer.
|
|
|