Re: near duplicates in short text fields

From: rogergorden..._at_gmail.com <rogergorden_at_gmail.com>
Date: Mon, 18 Aug 2008 08:45:59 -0700 (PDT)
Message-ID: <b5a3ff7e-9176-4f01-ad4c-28ad84a5daac_at_26g2000hsk.googlegroups.com>

On Aug 18, 8:59�am, merkury <david.oberm..._at_idealo.de> wrote:
> Hi Roelof Schierbeek,
>
> Thank you for your reply.
>
> I am sorry but your solution will not work on our datasets.
> The first 15 characters may occur in each string even if they are not
> "near".
> What I need is, that many words must coincide and many 2word or 3word
> combinations (maybe even more) should be the same. I guess this is more
> likely to fit our needs.
>
> I suppose de-duplication is only relevant for duplicates not for "near"
> duplicates.
>
> So our problem is to find for each dataset all nearest neighbours. We
> (will) have 20 million datasets. This means a comparison of 20million
> times 20million comparisons, if there is no better approach.
>
> Thanks
>
> merkury
>
> R. Schierbeek schrieb:
>
>
>
> > Hello merkury,
>
> > You might try the instr function:
>
> > � select s.naam ,s.key
> > � from � temp �S
> > � , � � �temp �E
> > � where ( �instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
> > � � � � or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
> > � � � � )
> > � and � s.key �!= E.key
>
> > OR:
> > � � select s.naam ,s.key
> > � � from � temp �S
> > � � , � � �temp �E
> > � � where ( �instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
> > � � � � � or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
> > � � � � � )
> > � � and � s.rowid != E.rowid
>
> > Also you can remove "symbols" or numbers from a string like this :
> > substr (
>
> > � �translate (p_text ,'~`!_at_#$%^&*()_-+={}|[]\:";''<>?,./' ,' ') �,1, nvl(p_length, LENGTH (p_text))
>
> > �);
>
> > But there are many tools on the market; google for Duplicating or De-duplicating tool.
>
> > Met vriendelijke groeten
>
> > � �Roelof Schierbeek , NL
>
> > ----- Original Message -----
> > From: "merkury" <david.oberm..._at_idealo.de>
> > Newsgroups: comp.databases.oracle.tools
> > Sent: Friday, August 15, 2008 8:08 PM
> > Subject: near duplicates in short text fields
>
> >> Hi,
>
> >> can anybody tell me how to find near duplicates in a large amount (20
> >> million) short text labels?
>
> >> Is there any database tool which does just this?
>
> >> I give you some examples:
>
> >> not near:
> >> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
> >> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>
> >> near:
> >> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
> >> Shirt Striped - aqua/white - S; (Angebot von)
>
> >> near:
> >> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
> >> jeanspoint74)
> >> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
> >> jeanspoint74)
>
> >> near:
> >> � 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
> >> jeanspoint74)
> >> � 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
> >> jeanspoint74)
>
> >> Thanks
>
> >> merkury- Hide quoted text -
>
> - Show quoted text -

You'll most likely have to write some sort of algorythm yourself to come up with matches and near matches.
I had to do something similar while merging a user list that was store in 2 separate database tables, (different spellings of last names, addresses not fully filled in etc.

Good luck.

Roger Gorden
Senior Database Programmer/Analyst
John Wiley and Sons, Inc. Received on Mon Aug 18 2008 - 17:45:59 CEST

This message: [ Message body ]
Next message: Malcolm Dew-Jones: "Re: near duplicates in short text fields"
Maybe in reply to: merkury: "near duplicates in short text fields"
In reply to merkury: "Re: near duplicates in short text fields"
Next in thread: Malcolm Dew-Jones: "Re: near duplicates in short text fields"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message