A Better Dedupe Script

From: Thomas B. Cox <tcox_at_netcom.com>
Date: Fri, 5 Aug 1994 22:48:20 GMT
Message-ID: <tcoxCu33CK.Ft0_at_netcom.com>

Matthew J Carr <carr_at_ist.flinders.edu.au> wrote:
>I have a table consisting of 3 columns: thread_id, mess_id, length
>I am trying to remove duplicate rows of thread_id and mess_id so only
>one copy remains

>Is it possible to construct a query using the rowid?

>e.g delete from th_table t1
> where rowid > (select rowid from th_table
> where thread_id = t1.thread_id
> and mess_id = t1.mess_id
> and rowid < t1.rowid)

>It returns an errror: ORA-01427: single-row subquery returns more than one row

>Can it be done this way????

Try this instead (reposted from elsewhere in this newsgroup):

There are several ways. Some are faster than others.

Deleting using max(rowid) only takes out one row. What if you have three or four duplicates?

Here is a better (or at least faster executing) method. It handles concatenated primary keys as well as simple ones, and can handle the case where there are more than just one duplicate (i.e. triplicate, etc., rows).

   delete from TABLE_1 where rowid in
   (
    select a.rowid from TABLE_1 a
    minus
    select min(b.rowid) from TABLE_1 b
     group by [one or more Primary Key columns]    )

The parenthesized select takes all the rowids, and subtracts the first rowid for each primary key value.

This is not the easiest code to follow at first, especially if you're not used to the MINUS statement. However, I've found it to run as much as twice as fast as the subquery approach for tables up to 10,000 rows. And the ability to easily handle concatenated primary keys can be handy.

Remember that the *functional* primary key may be different from the *nominal* primary key, *if* you have duplicate rows. For example, we had tables that used triggers to auto-enter a sequential number in the Primary Key column. We mistakenly loaded data twice. The table's nominal PK column really did have unique values throughout. But a combination of four other columns was supposed to be unique also, i.e. it functioned as our Primary Key for purposes of de-duplicating the table.

Cheers.

-Tom

-- 
Thomas Cox      tcox_at_netcom.com        Cox for Labor Commissioner

Received on Sat Aug 06 1994 - 00:48:20 CEST

This message: [ Message body ]
Next message: jfr_at_trwlasd.com: "Oracle confuses character string '' and NULL values"
Previous message: G Suryanarayan: "ORACLE Ver 6.0 DOS Installation problem..."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message