De-duplicating (was Re: Finding duplicate rows in a table)

From: Thomas B. Cox <tcox_at_netcom.com>
Date: Fri, 5 Aug 1994 15:51:51 GMT
Message-ID: <tcoxCu2K2G.MH7_at_netcom.com>


Edward Dansker <edansker_at_erseq.us.oracle.com> wrote:
>MarkB_at_aboy.demon.co.uk writes:
>>jameson_at_coupler.drink-eid "131W10000-JamesonJ(DR2483" writes:
 

>>> There are several tables in our database that contain duplicate
>>> records. Isn't there
>>> some simple SQL code to locate duplicate rows in a table? I would really
>>> appreciate it if you could post a sample.

   There are several ways. Some are faster than others.

>>select #your primary key here#, count (#your primary key here#)
>>from table
>>group by #your primary key here#
>>having count (#your primary key here#) > 1
>>/
 

>>The question is, are these rows truly identical throughout, or do they
>>simply have identical primary keys?

   If he has "duplicate records" as he claims, then all values in all    columns are the same for at least two rows.

>>If the former, it doesn't matter which you delete, therefore simply
>>delete the one with the max(rowid) in a correlated query.

   Close, but not quite. Deleting using max(rowid) only takes out one    row. What if you have three or four duplicates?

>>If the latter, you'll have to decide which you want to keep and
>>delete explicitly by rowid.
 

>>Stick a unique index on the key is my suggestion.

   This will prevent duplicates in future. Of course, you can't create    the unique index until after you resolve the duplicates.

>Here's the way to eliminate the duplicate rows with a correlated subquery:
 

>Assume first that col-name is the name of a candidate primary key column, eg.
>something that you can use to retain each unique row in your table (employee
>number, badge number, etc.)
 

>delete from table-name outer
> where rowid >
> (select min(rowid) from table-name
> where col-name = outer.col-name);

   Here is a better (or at least faster executing) method. It handles    concatenated primary keys as well as simple ones.

   delete from TABLE_1 where rowid in
   (
    select a.rowid from TABLE_1 a
    minus
    select min(b.rowid) from TABLE_1 b
     group by [one or more Primary Key columns]    )

   The parenthesized select takes all the rowids, and subtracts the    first rowid for each primary key value.

   This is not the easiest code to follow at first, especially if you're    not used to the MINUS statement. However, I've found it to run as    much as twice as fast as the subquery approach for tables up to    10,000 rows. And the ability to easily handle concatenated primary    keys can be handy.

   Remember that the *functional* primary key may be different from the    *nominal* primary key, *if* you have duplicate rows. For example, we    had tables that used triggers to auto-enter a sequential number in    the Primary Key column. We mistakenly loaded data twice. The    table's nominal PK column really did have unique values throughout.    But a combination of four other columns was supposed to be unique    also, i.e. it functioned as our Primary Key for purposes of    de-duplicating the table.

   Cheers.

    -Tom

-- 
Thomas Cox      tcox_at_netcom.com        Cox for Labor Commissioner
Received on Fri Aug 05 1994 - 17:51:51 CEST

Original text of this message