Re: near duplicates in short text fields

From: DA Morgan <damorgan_at_psoug.org>
Date: Mon, 18 Aug 2008 12:10:02 -0700
Message-ID: <1219086595.356745_at_bubbleator.drizzle.com>


merkury wrote:
> Hi Roelof Schierbeek,
>
>
> Thank you for your reply.
>
> I am sorry but your solution will not work on our datasets.
> The first 15 characters may occur in each string even if they are not
> "near".
> What I need is, that many words must coincide and many 2word or 3word
> combinations (maybe even more) should be the same. I guess this is more
> likely to fit our needs.
>
> I suppose de-duplication is only relevant for duplicates not for "near"
> duplicates.
>
> So our problem is to find for each dataset all nearest neighbours. We
> (will) have 20 million datasets. This means a comparison of 20million
> times 20million comparisons, if there is no better approach.
>
>
> Thanks
>
> merkury
>
>
>
>
> R. Schierbeek schrieb:
>

>> Hello merkury,
>>
>> You might try the instr function:
>>
>>   select s.naam ,s.key
>>   from   temp  S
>>   ,      temp  E
>>   where (  instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 
>>         or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>>         )
>>   and   s.key  != E.key
>>
>> OR:
>>     select s.naam ,s.key
>>     from   temp  S
>>     ,      temp  E
>>     where (  instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 
>>           or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>>           )
>>     and   s.rowid != E.rowid
>>
>> Also you can remove "symbols" or numbers from a string like this :
>> substr (
>>
>>    translate (p_text ,'~`!_at_#$%^&*()_-+={}|[]\:";''<>?,./' ,' ')  ,1, 
>> nvl(p_length, LENGTH (p_text))
>>  );
>>
>> But there are many tools on the market; google for Duplicating or 
>> De-duplicating tool.
>>
>> Met vriendelijke groeten
>>
>>    Roelof Schierbeek , NL
>>
>>  
>> ----- Original Message ----- From: "merkury" <david.obermann_at_idealo.de>
>> Newsgroups: comp.databases.oracle.tools
>> Sent: Friday, August 15, 2008 8:08 PM
>> Subject: near duplicates in short text fields
>>
>>
>>  
>>> Hi,
>>>
>>>
>>> can anybody tell me how to find near duplicates in a large amount (20 
>>> million) short text labels?
>>>
>>> Is there any database tool which does just this?
>>>
>>> I give you some examples:
>>>
>>> not near:
>>> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>>
>>>
>>> near:
>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>> Shirt Striped - aqua/white - S; (Angebot von)
>>>
>>> near:
>>> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop 
>>> jeanspoint74)
>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop 
>>> jeanspoint74)
>>>
>>> near:
>>>   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop 
>>> jeanspoint74)
>>>   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop 
>>> jeanspoint74)
>>>
>>>
>>>
>>> Thanks
>>>
>>> merkury

Look at the UTL_MATCH built in package. It contains an API for both the JARO WANKLER and LEVENSHTEIN algorithms.

-- 
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan_at_x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org
Received on Mon Aug 18 2008 - 21:10:02 CEST

Original text of this message