Selecting SIMILAR, not the same records (PROBABLE) duplicates

From: kroger <kroger_at_vp.pl>
Date: Tue, 5 Sep 2006 21:05:15 +0200
Message-ID: <edkhpd$1cg$1@news.onet.pl>

Hi,

I've been struggling with that for two days now... There is a simple solution for finding duplicates - with GROUP BY and HAVING COUNT(*)>1 but it is not enough in my case...

For the example table as follows:

id           || name
1            || aaa
2            || aaa xxx
3            || aaa
4            || aaah
5            || bbb
6            || bbb p
7            || ccc

I would need to see the following:

rowId || duplicatedRowId

1                2
1                3
1                4
5                6

This means, that I need to build a query that for a given row selects others, whose beginning of the name is similar...

In fact, something like that:

select t1.id as rowId, t2.id as duplicateRowId from test t1, test t2 where t1.name like t2.name||'%' and t1.id<>t2.id

would do... but in such case I end up with all id-pairs (meaning I get 5 is a duplicate of 6 and 6 is a duplicate of 5)

I'm neither Oracle or SQL expert... Is there any easy way for obtaining what I desired? I feel I'm getting dumb with that.. Or do I need to go into cursors etc?
I'm also looking for efficient solution for that issue, since the source table to lookup is pretty big... 1-2M records...

Thanks in advance,
Best regards,

Kroger Received on Tue Sep 05 2006 - 14:05:15 CDT