Re: near duplicates in short text fields

From: Shakespeare <whatsin_at_xs4all.nl>
Date: Tue, 19 Aug 2008 10:45:38 +0200
Message-ID: <48aa883a$0$183$e4fe514c@news.xs4all.nl>

"DA Morgan" <damorgan_at_psoug.org> schreef in bericht news:1219086595.356745_at_bubbleator.drizzle.com...

> merkury wrote:

>> Hi Roelof Schierbeek,
>>
>>
>> Thank you for your reply.
>>
>> I am sorry but your solution will not work on our datasets.
>> The first 15 characters may occur in each string even if they are not
>> "near".
>> What I need is, that many words must coincide and many 2word or 3word
>> combinations (maybe even more) should be the same. I guess this is more
>> likely to fit our needs.
>>
>> I suppose de-duplication is only relevant for duplicates not for "near"
>> duplicates.
>>
>> So our problem is to find for each dataset all nearest neighbours. We
>> (will) have 20 million datasets. This means a comparison of 20million
>> times 20million comparisons, if there is no better approach.
>>
>>
>> Thanks
>>
>> merkury
>>
>>
>>
>>
>> R. Schierbeek schrieb:
>>
>>> Hello merkury,
>>>
>>> You might try the instr function:
>>>
>>> select s.naam ,s.key
>>> from temp S
>>> , temp E
>>> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
>>> instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>>> )
>>> and s.key != E.key
>>>
>>> OR:
>>> select s.naam ,s.key
>>> from temp S
>>> , temp E
>>> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
>>> instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>>> )
>>> and s.rowid != E.rowid
>>>
>>> Also you can remove "symbols" or numbers from a string like this :
>>> substr (
>>>
>>> translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1,
>>> nvl(p_length, LENGTH (p_text))
>>> );
>>>
>>> But there are many tools on the market; google for Duplicating or
>>> De-duplicating tool.
>>>
>>> Met vriendelijke groeten
>>>
>>> Roelof Schierbeek , NL
>>>
>>> ----- Original Message ----- From: "merkury" <david.obermann_at_idealo.de>
>>> Newsgroups: comp.databases.oracle.tools
>>> Sent: Friday, August 15, 2008 8:08 PM
>>> Subject: near duplicates in short text fields
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>>
>>>> can anybody tell me how to find near duplicates in a large amount (20
>>>> million) short text labels?
>>>>
>>>> Is there any database tool which does just this?
>>>>
>>>> I give you some examples:
>>>>
>>>> not near:
>>>> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
>>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>>>
>>>>
>>>> near:
>>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>>> Shirt Striped - aqua/white - S; (Angebot von)
>>>>
>>>> near:
>>>> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
>>>> jeanspoint74)
>>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>>>> jeanspoint74)
>>>>
>>>> near:
>>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>>>> jeanspoint74)
>>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
>>>> jeanspoint74)
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> merkury
>
> Look at the UTL_MATCH built in package. It contains an API for both
> the JARO WANKLER and LEVENSHTEIN algorithms.
> -- 
> Daniel A. Morgan
> Oracle Ace Director & Instructor
> University of Washington
> damorgan_at_x.washington.edu (replace x with u to respond)
> Puget Sound Oracle Users Group
> www.psoug.org

I hope you mean Jaro Winkler...

Shakespeare Received on Tue Aug 19 2008 - 03:45:38 CDT

Original text of this message