Re: near duplicates in short text fields
Date: Mon, 18 Aug 2008 12:10:02 -0700
Message-ID: <1219086595.356745_at_bubbleator.drizzle.com>
merkury wrote:
> Hi Roelof Schierbeek,
>
>
> Thank you for your reply.
>
> I am sorry but your solution will not work on our datasets.
> The first 15 characters may occur in each string even if they are not
> "near".
> What I need is, that many words must coincide and many 2word or 3word
> combinations (maybe even more) should be the same. I guess this is more
> likely to fit our needs.
>
> I suppose de-duplication is only relevant for duplicates not for "near"
> duplicates.
>
> So our problem is to find for each dataset all nearest neighbours. We
> (will) have 20 million datasets. This means a comparison of 20million
> times 20million comparisons, if there is no better approach.
>
>
> Thanks
>
> merkury
>
>
>
>
> R. Schierbeek schrieb:
>
>> Hello merkury,
>>
>> You might try the instr function:
>>
>> select s.naam ,s.key
>> from temp S
>> , temp E
>> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
>> or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>> )
>> and s.key != E.key
>>
>> OR:
>> select s.naam ,s.key
>> from temp S
>> , temp E
>> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
>> or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>> )
>> and s.rowid != E.rowid
>>
>> Also you can remove "symbols" or numbers from a string like this :
>> substr (
>>
>> translate (p_text ,'~`!_at_#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1,
>> nvl(p_length, LENGTH (p_text))
>> );
>>
>> But there are many tools on the market; google for Duplicating or
>> De-duplicating tool.
>>
>> Met vriendelijke groeten
>>
>> Roelof Schierbeek , NL
>>
>>
>> ----- Original Message ----- From: "merkury" <david.obermann_at_idealo.de>
>> Newsgroups: comp.databases.oracle.tools
>> Sent: Friday, August 15, 2008 8:08 PM
>> Subject: near duplicates in short text fields
>>
>>
>>
>>> Hi,
>>>
>>>
>>> can anybody tell me how to find near duplicates in a large amount (20
>>> million) short text labels?
>>>
>>> Is there any database tool which does just this?
>>>
>>> I give you some examples:
>>>
>>> not near:
>>> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>>
>>>
>>> near:
>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>> Shirt Striped - aqua/white - S; (Angebot von)
>>>
>>> near:
>>> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
>>> jeanspoint74)
>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>>> jeanspoint74)
>>>
>>> near:
>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>>> jeanspoint74)
>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
>>> jeanspoint74)
>>>
>>>
>>>
>>> Thanks
>>>
>>> merkury
Look at the UTL_MATCH built in package. It contains an API for both the JARO WANKLER and LEVENSHTEIN algorithms.
-- Daniel A. Morgan Oracle Ace Director & Instructor University of Washington damorgan_at_x.washington.edu (replace x with u to respond) Puget Sound Oracle Users Group www.psoug.orgReceived on Mon Aug 18 2008 - 21:10:02 CEST
