Re: near duplicates in short text fields
Date: Mon, 18 Aug 2008 12:10:02 -0700
Message-ID: <1219086595.356745_at_bubbleator.drizzle.com>
merkury wrote:
> Hi Roelof Schierbeek,
>
>
> Thank you for your reply.
>
> I am sorry but your solution will not work on our datasets.
> The first 15 characters may occur in each string even if they are not
> "near".
> What I need is, that many words must coincide and many 2word or 3word
> combinations (maybe even more) should be the same. I guess this is more
> likely to fit our needs.
>
> I suppose de-duplication is only relevant for duplicates not for "near"
> duplicates.
>
> So our problem is to find for each dataset all nearest neighbours. We
> (will) have 20 million datasets. This means a comparison of 20million
> times 20million comparisons, if there is no better approach.
>
>
> Thanks
>
> merkury
>
>
>
>
> R. Schierbeek schrieb:
>
>> Hello merkury, >> >> You might try the instr function: >> >> select s.naam ,s.key >> from temp S >> , temp E >> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 >> or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 >> ) >> and s.key != E.key >> >> OR: >> select s.naam ,s.key >> from temp S >> , temp E >> where ( instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0 >> or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0 >> ) >> and s.rowid != E.rowid >> >> Also you can remove "symbols" or numbers from a string like this : >> substr ( >> >> translate (p_text ,'~`!_at_#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') ,1, >> nvl(p_length, LENGTH (p_text)) >> ); >> >> But there are many tools on the market; google for Duplicating or >> De-duplicating tool. >> >> Met vriendelijke groeten >> >> Roelof Schierbeek , NL >> >> >> ----- Original Message ----- From: "merkury" <david.obermann_at_idealo.de> >> Newsgroups: comp.databases.oracle.tools >> Sent: Friday, August 15, 2008 8:08 PM >> Subject: near duplicates in short text fields >> >> >> >>> Hi, >>> >>> >>> can anybody tell me how to find near duplicates in a large amount (20 >>> million) short text labels? >>> >>> Is there any database tool which does just this? >>> >>> I give you some examples: >>> >>> not near: >>> Rugby Polo - black/white - S; (Angebot von Kabelmeister) >>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) >>> >>> >>> near: >>> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) >>> Shirt Striped - aqua/white - S; (Angebot von) >>> >>> near: >>> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop >>> jeanspoint74) >>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop >>> jeanspoint74) >>> >>> near: >>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop >>> jeanspoint74) >>> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop >>> jeanspoint74) >>> >>> >>> >>> Thanks >>> >>> merkury
Look at the UTL_MATCH built in package. It contains an API for both the JARO WANKLER and LEVENSHTEIN algorithms.
-- Daniel A. Morgan Oracle Ace Director & Instructor University of Washington damorgan_at_x.washington.edu (replace x with u to respond) Puget Sound Oracle Users Group www.psoug.orgReceived on Mon Aug 18 2008 - 21:10:02 CEST