Re: near duplicates in short text fields

From: merkury <david.obermann_at_idealo.de>
Date: Mon, 18 Aug 2008 14:59:07 +0200
Message-ID: <48A9721B.5060703@idealo.de>


Hi Roelof Schierbeek,

Thank you for your reply.

I am sorry but your solution will not work on our datasets. The first 15 characters may occur in each string even if they are not "near".
What I need is, that many words must coincide and many 2word or 3word combinations (maybe even more) should be the same. I guess this is more likely to fit our needs.

I suppose de-duplication is only relevant for duplicates not for "near" duplicates.

So our problem is to find for each dataset all nearest neighbours. We (will) have 20 million datasets. This means a comparison of 20million times 20million comparisons, if there is no better approach.

Thanks

merkury

R. Schierbeek schrieb:

> Hello merkury,
>
> You might try the instr function:
>
> select s.naam ,s.key
> from temp S
> , temp E
> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
> or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
> )
> and s.key != E.key
>
> OR:
> select s.naam ,s.key
> from temp S
> , temp E
> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
> or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
> )
> and s.rowid != E.rowid
>
> Also you can remove "symbols" or numbers from a string like this :
> substr (
>
> translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, nvl(p_length, LENGTH (p_text))
>
> );
>
> But there are many tools on the market; google for Duplicating or De-duplicating tool.
>
> Met vriendelijke groeten
>
> Roelof Schierbeek , NL
>
>
> ----- Original Message -----
> From: "merkury" <david.obermann_at_idealo.de>
> Newsgroups: comp.databases.oracle.tools
> Sent: Friday, August 15, 2008 8:08 PM
> Subject: near duplicates in short text fields
>
>
>
>> Hi,
>>
>>
>> can anybody tell me how to find near duplicates in a large amount (20
>> million) short text labels?
>>
>> Is there any database tool which does just this?
>>
>> I give you some examples:
>>
>> not near:
>> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>
>>
>> near:
>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>> Shirt Striped - aqua/white - S; (Angebot von)
>>
>> near:
>> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
>> jeanspoint74)
>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>> jeanspoint74)
>>
>> near:
>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>> jeanspoint74)
>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
>> jeanspoint74)
>>
>>
>>
>> Thanks
>>
>> merkury
>>
Received on Mon Aug 18 2008 - 07:59:07 CDT

Original text of this message