near duplicates in short text fields

From: merkury <david.obermann_at_idealo.de>
Date: Fri, 15 Aug 2008 20:08:04 +0200
Message-ID: <g84gm4$dep$02$2_at_news.t-online.com>



Hi,

[Quoted] [Quoted] can anybody tell me how to find near duplicates in a large amount (20 million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74)

near:

   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74)

   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74)

Thanks

merkury Received on Fri Aug 15 2008 - 20:08:04 CEST

Original text of this message