De-duplicating (was Re: Finding duplicate rows in a table)
Date: Fri, 5 Aug 1994 15:51:51 GMT
Message-ID: <tcoxCu2K2G.MH7_at_netcom.com>
Edward Dansker <edansker_at_erseq.us.oracle.com> wrote:
>MarkB_at_aboy.demon.co.uk writes:
>>jameson_at_coupler.drink-eid "131W10000-JamesonJ(DR2483" writes:
>>> There are several tables in our database that contain duplicate
>>> records. Isn't there
>>> some simple SQL code to locate duplicate rows in a table? I would really
>>> appreciate it if you could post a sample.
There are several ways. Some are faster than others.
>>select #your primary key here#, count (#your primary key here#)
>>from table
>>group by #your primary key here#
>>having count (#your primary key here#) > 1
>>/
>>The question is, are these rows truly identical throughout, or do they
>>simply have identical primary keys?
>>If the former, it doesn't matter which you delete, therefore simply
>>delete the one with the max(rowid) in a correlated query.
>>If the latter, you'll have to decide which you want to keep and
>>delete explicitly by rowid.
>>Stick a unique index on the key is my suggestion.
This will prevent duplicates in future. Of course, you can't create the unique index until after you resolve the duplicates.
>Here's the way to eliminate the duplicate rows with a correlated subquery:
>Assume first that col-name is the name of a candidate primary key column, eg.
>something that you can use to retain each unique row in your table (employee
>number, badge number, etc.)
>delete from table-name outer
> where rowid >
> (select min(rowid) from table-name
> where col-name = outer.col-name);
Here is a better (or at least faster executing) method. It handles concatenated primary keys as well as simple ones.
delete from TABLE_1 where rowid in
(
select a.rowid from TABLE_1 a
minus
select min(b.rowid) from TABLE_1 b
group by [one or more Primary Key columns]
)
The parenthesized select takes all the rowids, and subtracts the first rowid for each primary key value.
This is not the easiest code to follow at first, especially if you're not used to the MINUS statement. However, I've found it to run as much as twice as fast as the subquery approach for tables up to 10,000 rows. And the ability to easily handle concatenated primary keys can be handy.
Remember that the *functional* primary key may be different from the *nominal* primary key, *if* you have duplicate rows. For example, we had tables that used triggers to auto-enter a sequential number in the Primary Key column. We mistakenly loaded data twice. The table's nominal PK column really did have unique values throughout. But a combination of four other columns was supposed to be unique also, i.e. it functioned as our Primary Key for purposes of de-duplicating the table.
Cheers.
-Tom
-- Thomas Cox tcox_at_netcom.com Cox for Labor CommissionerReceived on Fri Aug 05 1994 - 17:51:51 CEST