Re: near duplicates in short text fields

From: DA Morgan <damorgan_at_psoug.org>
Date: Tue, 19 Aug 2008 06:07:26 -0700
Message-ID: <1219151236.967502@bubbleator.drizzle.com>


Shakespeare wrote:

> "DA Morgan" <damorgan_at_psoug.org> schreef in bericht 
> news:1219086595.356745_at_bubbleator.drizzle.com...
>> merkury wrote:

>>> Hi Roelof Schierbeek,
>>>
>>>
>>> Thank you for your reply.
>>>
>>> I am sorry but your solution will not work on our datasets.
>>> The first 15 characters may occur in each string even if they are not
>>> "near".
>>> What I need is, that many words must coincide and many 2word or 3word
>>> combinations (maybe even more) should be the same. I guess this is more
>>> likely to fit our needs.
>>>
>>> I suppose de-duplication is only relevant for duplicates not for "near"
>>> duplicates.
>>>
>>> So our problem is to find for each dataset all nearest neighbours. We
>>> (will) have 20 million datasets. This means a comparison of 20million
>>> times 20million comparisons, if there is no better approach.
>>>
>>>
>>> Thanks
>>>
>>> merkury
>>>
>>>
>>>
>>>
>>> R. Schierbeek schrieb:
>>>
>>>> Hello merkury,
>>>>
>>>> You might try the instr function:
>>>>
>>>> select s.naam ,s.key
>>>> from temp S
>>>> , temp E
>>>> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
>>>> instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>>>> )
>>>> and s.key != E.key
>>>>
>>>> OR:
>>>> select s.naam ,s.key
>>>> from temp S
>>>> , temp E
>>>> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 or
>>>> instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
>>>> )
>>>> and s.rowid != E.rowid
>>>>
>>>> Also you can remove "symbols" or numbers from a string like this :
>>>> substr (
>>>>
>>>> translate (p_text ,'~`!@#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1,
>>>> nvl(p_length, LENGTH (p_text))
>>>> );
>>>>
>>>> But there are many tools on the market; google for Duplicating or
>>>> De-duplicating tool.
>>>>
>>>> Met vriendelijke groeten
>>>>
>>>> Roelof Schierbeek , NL
>>>>
>>>> ----- Original Message ----- From: "merkury" <david.obermann_at_idealo.de>
>>>> Newsgroups: comp.databases.oracle.tools
>>>> Sent: Friday, August 15, 2008 8:08 PM
>>>> Subject: near duplicates in short text fields
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> can anybody tell me how to find near duplicates in a large amount (20
>>>>> million) short text labels?
>>>>>
>>>>> Is there any database tool which does just this?
>>>>>
>>>>> I give you some examples:
>>>>>
>>>>> not near:
>>>>> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
>>>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>>>>
>>>>>
>>>>> near:
>>>>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>>>>> Shirt Striped - aqua/white - S; (Angebot von)
>>>>>
>>>>> near:
>>>>> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
>>>>> jeanspoint74)
>>>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>>>>> jeanspoint74)
>>>>>
>>>>> near:
>>>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
>>>>> jeanspoint74)
>>>>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
>>>>> jeanspoint74)
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> merkury
>> Look at the UTL_MATCH built in package. It contains an API for both
>> the JARO WANKLER and LEVENSHTEIN algorithms.
>> -- 
>> Daniel A. Morgan
>> Oracle Ace Director & Instructor
>> University of Washington
>> damorgan_at_x.washington.edu (replace x with u to respond)
>> Puget Sound Oracle Users Group
>> www.psoug.org
> 
> I hope you mean Jaro Winkler...
> 
> Shakespeare 

It's a typo ... That's my story ... And I'm sticking with it. <g>

-- 
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan_at_x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org
Received on Tue Aug 19 2008 - 08:07:26 CDT

Original text of this message