Re: De-duping a table

From: Bart <Bart_at_opcomp.demon.co.uk>
Date: Mon, 9 Apr 2001 18:28:28 +0100
Message-ID: <986837691.20672.0.nnrp-14.9e985e86@news.demon.co.uk>

Thanks to those who contributed to this.

There seems three traps to doing this (1) performance, (2) deleting only one duplicate and (3) deleting all bar the duplicates.

My simple query was syntactically correct, but I killed it after 72 hours and I couldn't see any gains from using hints.

The following de-duped a 16M row table (2 dupes) in 12 minutes and seems pretty much the best you can do.

--dedupe.sql

drop table dupes;

create table dupes
nologging
pctfree 0
as
select /*+ full(t) */ KEY_FIELD1, KEY_FIELD2,

max(rowid) maxrowid
from MY_TABLE t
group by KEY_FIELD1, KEY_FIELD2
having count(*) > 1;

commit;

delete from MY_TABLE
where rowid in (select /*+ full(t) full(d) use_hash(t, d) */ t.rowid from MY_TABLE t, dupes d

where t.KEY_FIELD1 = d.KEY_FIELD1 and
      t.KEY_FIELD2 = d.KEY_FIELD2 and
      t.rowid < d.maxrowid);

commit;

drop table dupes;
"Alan B" <Alan.B_at_opcomp.demon.co.uk> wrote in message news:986633055.23980.0.nnrp-10.9e985e86_at_news.demon.co.uk...
> Hi,
>
> I'm looking for a simple way to de-dupe a table based on primary key (I
> don't mind which dupe gets deleted).
>
> I've written some complex SQL which does an 8m row table in about 10 mins,
> but the following code which is much simpler disappears for hours.
>
> There are only a few duplicates in the table.
>
> delete from MY_TABLE
> where ROWID not in (select min(ROWID)
> from MY_TABLE
> group by MY_PRIMARY_KEY);
>
> Any suggestions would be appreciated.
>
> Regards,
> Alan
>
>
Received on Mon Apr 09 2001 - 12:28:28 CDT