Re: near duplicates in short text fields

From: Malcolm Dew-Jones <yf110_at_vtn1.victoria.tc.ca>
Date: 18 Aug 2008 10:22:30 -0800
Message-ID: <48a9afd6_at_news.victoria.tc.ca>


merkury (david.obermann_at_idealo.de) wrote:
: Hi,

: can anybody tell me how to find near duplicates in a large amount (20
: million) short text labels?

: Is there any database tool which does just this?

I have not used it, but oracle has a thing called "Oracle Text", I would assume it has routines to assist in this sort of task.

I assume you could load all your data, and then as a second pass (so to speak) for each row of data, do the appropriate fuzzy search against the loaded data to find which other entries were similar. Received on Mon Aug 18 2008 - 20:22:30 CEST

Original text of this message