A Better Dedupe Script

From: Thomas B. Cox <tcox_at_netcom.com>
Date: Fri, 5 Aug 1994 22:48:20 GMT
Message-ID: <tcoxCu33CK.Ft0_at_netcom.com>


Matthew J Carr <carr_at_ist.flinders.edu.au> wrote:
>I have a table consisting of 3 columns: thread_id, mess_id, length
>I am trying to remove duplicate rows of thread_id and mess_id so only
>one copy remains
 

>Is it possible to construct a query using the rowid?
 

>e.g delete from th_table t1
> where rowid > (select rowid from th_table
> where thread_id = t1.thread_id
> and mess_id = t1.mess_id
> and rowid < t1.rowid)
 

>It returns an errror: ORA-01427: single-row subquery returns more than one row
 

>Can it be done this way????

Try this instead (reposted from elsewhere in this newsgroup):


   There are several ways. Some are faster than others.

   Deleting using max(rowid) only takes out one    row. What if you have three or four duplicates?

   Here is a better (or at least faster executing) method. It handles    concatenated primary keys as well as simple ones, and can handle    the case where there are more than just one duplicate (i.e.    triplicate, etc., rows).

   delete from TABLE_1 where rowid in
   (
    select a.rowid from TABLE_1 a
    minus
    select min(b.rowid) from TABLE_1 b
     group by [one or more Primary Key columns]    )

   The parenthesized select takes all the rowids, and subtracts the    first rowid for each primary key value.

   This is not the easiest code to follow at first, especially if you're    not used to the MINUS statement. However, I've found it to run as    much as twice as fast as the subquery approach for tables up to    10,000 rows. And the ability to easily handle concatenated primary    keys can be handy.

   Remember that the *functional* primary key may be different from the    *nominal* primary key, *if* you have duplicate rows. For example, we    had tables that used triggers to auto-enter a sequential number in    the Primary Key column. We mistakenly loaded data twice. The    table's nominal PK column really did have unique values throughout.    But a combination of four other columns was supposed to be unique    also, i.e. it functioned as our Primary Key for purposes of    de-duplicating the table.

   Cheers.

    -Tom

-- 
Thomas Cox      tcox_at_netcom.com        Cox for Labor Commissioner
Received on Sat Aug 06 1994 - 00:48:20 CEST

Original text of this message