RE: help with deleting duplicate records from very large table

From: Hallas John <John.Hallas_at_btcellnet.net>
Date: Wed, 12 Sep 2001 00:29:26 -0700
Message-ID: <F001.0038AF81.20010912001022@fatcity.com>

Hi, Suhen
The following is a set of notes I have cut from various list messages on deleting duplicates. For the 60M rows you are talking about the first option looks the best (committing frequently). Can I suggest you disable archive logging if at all possible and you will need to rebuild tables/index afterwards. The other alternative to try is geeting exceptions into another table and deleting using rowid.

HTH John

How about the following? This is what I use. For a huge table, you may want to do intermittent commits based on the rowcount.

declare
cursor get_dups is
select pk_col1, pk_col2, pk_col3, count(*) from table
group by pk_col1, pk_col2, pk_col3
having count(*) > 1;
dupRec get_dups%rowtype;
begin
for dupRec in get_dups loop
delete from table
where pk_col1 = dupRec.pk_col1
and pk_col2 = dupRec.pk_col2
and pk_col3 = dupRec.pk_col3
and rownum = 1;
end loop;
end;
/

Identify duplicate records:
select COL1, COL2, COL#, COUNT(*)
from <OWNER>.<TABLE_NAME>
group by COL1, COL2, COL#
having count(*) > 1;
Remove duplicate records:
delete from <OWNER>.<TABLE_NAME> a
where rowid < (
select max(rowid)
from <OWNER>.<TABLE_NAME> b
where b.COL1 = a.COL1

and b.COL2 = a.COL2
and b.COL# = a.COL#);
===========================================

Not particularly fast, but it will work: Create an exception table (you can use any valid table name): create table dups (
row_id rowid,
owner varchar2(30),
table_name varchar2(30),
constraint varchar2(30))
tablespace temp;
Then try create a unique key which will be violated by the duplicates which you are trying to remove: alter table <your_table> add constraint <your_constraint> unique
(

<column1>,
<column2>
)
exceptions into dups;
When you execute the alter table command, you will get an error message telling you that there are duplicates. The dups table will contain the rowids for the offending rows.

alter table my_table add constraint XYZ unique (col1, col2, col3) disable ;
create table my_table_dups (
dup_row rowid,
constr_name varchar2 (40),
constr_owner varchar2)
storage clause ......... ;
alter table my_table enable constraint XYZ exceptions into my_table_dups ;
This will not enable the unique key constraint if you have any duplicates on (col1, col2, col3) of my_table, but will place info on only the duplicates into table my_table_dups. Now since you have the rowid of the duplicates, depending on how many you have, you may want to write a PL/SQL construct to delete all duplicates from the original, or you could simply code something like :
> delete from my_table where rowid in
(select rowid from my_table_dups

where constr_name='XYZ' and constr_owner = 'MY_USER_ID') ; >commit;
Some times you will have a multi-million row table, in which case it will be good to use PL/SQL and delete from both tables as you go (also good for re-startability), committing every so often. This reduces likelihood of rollback segment contention, also avoid the dreaded 'snapshot too old' error and keeps all users happy since you are not locking thousands or hundred thousands of rows waiting for a commit and the end of a single DML.
Hope this helps. It is a different approach to this problem than many types of SQL statements which can sometimes get very complicated. It approaches the problem from a DDL perspective rather than DML in the identification of duplicates.
Hi,
I think this is what you want,using the rowid you can achieve this delete from table_name1
where rowid < (select max(rowid) from table_name2 where table_name1.col_name = table_name2.col_name) table_name1 and table_name2 are the same table. You can try to select before you delete
select * from table_name
where rowid < (select max(rowid) from table_name2 where table_name1.col_name = table_name2.col_name)

-----Original Message-----
From: Suhen Pather [mailto:Suhen.Pather_at_strandbags.com.au] Sent: 12 September 01 05:50
To: Multiple recipients of list ORACLE-L Subject: help with deleting duplicate records from very large table

List,

I need to delete duplicate records from a very large table (60 millions records +). There would be about 3 million duplicate entries.
What is the quickest way to do this?

The syntax that I am using is
delete from invaudee
where rowid not in (select min(rowid) from invaudee group by audit_number);

This is taking a long time to run. I cannot see any entries in v$transaction for the delete.

There is no indexes on the INVAUDEE table. I created an index on the primary key column but it still takes forever to run.
I do not have the space to CTAS.

Or should I write the duplicates to an EXCEPTIONS table and perform the delete based on the entries in the EXCEPTIONS table.

Any help would be greatly appreciated.

Suhen

This email and any attachments may be confidential and the subject of legal professional privilege. Any disclosure, use, storage or copying of this email without the consent of the sender is strictly prohibited. Please notify the sender immediately if you are not the intended recipient and then delete the email from your inbox and do not disclose the contents to another person, use, copy or store the information in any medium.

Received on Wed Sep 12 2001 - 02:29:26 CDT