near duplicates in short text fields

From: merkury <david.obermann_at_idealo.de>
Date: Fri, 15 Aug 2008 20:08:04 +0200
Message-ID: <g84gm4$dep$02$2@news.t-online.com>


Hi,

can anybody tell me how to find near duplicates in a large amount (20 million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister) Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)

near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister) Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74)

near:

   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop jeanspoint74)

   482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop jeanspoint74)

Thanks

merkury Received on Fri Aug 15 2008 - 13:08:04 CDT

Original text of this message