Re: near duplicates in short text fields

From: rogergorden..._at_gmail.com <rogergorden_at_gmail.com>
Date: Mon, 18 Aug 2008 08:45:59 -0700 (PDT)
Message-ID: <b5a3ff7e-9176-4f01-ad4c-28ad84a5daac_at_26g2000hsk.googlegroups.com>


On Aug 18, 8:59 am, merkury <david.oberm..._at_idealo.de> wrote:
> Hi Roelof Schierbeek,
>
> Thank you for your reply.
>
> I am sorry but your solution will not work on our datasets.
> The first 15 characters may occur in each string even if they are not
> "near".
> What I need is, that many words must coincide and many 2word or 3word
> combinations (maybe even more) should be the same. I guess this is more
> likely to fit our needs.
>
> I suppose de-duplication is only relevant for duplicates not for "near"
> duplicates.
>
> So our problem is to find for each dataset all nearest neighbours. We
> (will) have 20 million datasets. This means a comparison of 20million
> times 20million comparisons, if there is no better approach.
>
> Thanks
>
> merkury
>
> R. Schierbeek schrieb:
>
>
>
> > Hello merkury,
>
> > You might try the instr function:
>
> >   select s.naam ,s.key
> >   from   temp  S
> >   ,      temp  E
> >   where (  instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
> >         or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
> >         )
> >   and   s.key  != E.key
>
> > OR:
> >     select s.naam ,s.key
> >     from   temp  S
> >     ,      temp  E
> >     where (  instr( upper(s.naam) ,substr(upper(E.naam),1,15) ) > 0
> >           or instr( upper(E.naam) ,substr(upper(s.naam),1,15) ) > 0
> >           )
> >     and   s.rowid != E.rowid
>
> > Also you can remove "symbols" or numbers from a string like this :
> > substr (
>
> >    translate (p_text ,'~`!_at_#$%^&*()_-+={}|[]\:";''<>?,./' ,' ')  ,1, nvl(p_length, LENGTH (p_text))
>
> >  );
>
> > But there are many tools on the market; google for Duplicating or De-duplicating tool.
>
> > Met vriendelijke groeten
>
> >    Roelof Schierbeek , NL
>
> > ----- Original Message -----
> > From: "merkury" <david.oberm..._at_idealo.de>
> > Newsgroups: comp.databases.oracle.tools
> > Sent: Friday, August 15, 2008 8:08 PM
> > Subject: near duplicates in short text fields
>
> >> Hi,
>
> >> can anybody tell me how to find near duplicates in a large amount (20
> >> million) short text labels?
>
> >> Is there any database tool which does just this?
>
> >> I give you some examples:
>
> >> not near:
> >> Rugby Polo - black/white - S; (Angebot von Kabelmeister)
> >> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
>
> >> near:
> >> Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
> >> Shirt Striped - aqua/white - S; (Angebot von)
>
> >> near:
> >> 301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
> >> jeanspoint74)
> >> 482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
> >> jeanspoint74)
>
> >> near:
> >>   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
> >> jeanspoint74)
> >>   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
> >> jeanspoint74)
>
> >> Thanks
>
> >> merkury- Hide quoted text -
>
> - Show quoted text -

You'll most likely have to write some sort of algorythm yourself to come up with matches and near matches.
I had to do something similar while merging a user list that was store in 2 separate database tables, (different spellings of last names, addresses not fully filled in etc.

Good luck.

Roger Gorden
Senior Database Programmer/Analyst
John Wiley and Sons, Inc. Received on Mon Aug 18 2008 - 17:45:59 CEST

Original text of this message