Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: de-dup process

Re: de-dup process

From: <tboss_at_bossconsulting.com>
Date: Tue, 12 Dec 2006 20:42:38 -0500 (EST)
Message-Id: <200612130142.kBD1gcXB016411@piccollo.p6m7g8.net>


From asktom, the best way I've found is to use Tom's little code snippet below:

delete from table your_huge_table
where rowid in
  (select rid

        from
        (select rowid rid,
                     row_number() over
                        (partition by varchar_that_defines_duplicates
                         order by rowid ) rn
   from your_huge_table
  )
where rn <> 1
)
/

It will get multiple duplicate rows, and works far faster than any not exists, minus, or cursor-based solution.

A few other options exist for you if you can do them that may be faster 1. create table as select distinct; probably faster than doing any sort of deleting.

2. Alter table mytab enable constraint PK exceptions into exceptions; Better way; much faster for large tables, lets you audit the duplicate rows by examining exceptions table. (you must run $ORACLE_HOME/rdbms/admin/utlexcpt.sql before doing this). Con: the exceptions table will contain BOTH duplicate rows in the source table ... you'll have to delete them manually.

3. Use unix.  Perhaps the purest fastest way is to use unix sort/unique commands:
a. sqlload data out or select out delimited
b. sort filename | uniq > new file
c. sqlload back in.

only a viable option if your table is "thin" and only has a few columns.

hope this helps, todd

>
> We have a huge table (> 160 million rows) which has about 20 million duplicate rows that we need to delete. What is the most efficient way to do this as we will need to do this daily?
> A single varchar2(30) column is used to identified duplicates. We could possibly have > 2 rows of duplicates.
>
> We are doing direct path load so no unique key indexes can be put on the table to take care of the duplicates.
>
> Platform: Oracle 10G RAC (2 node) on Solaris 10.
>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Dec 12 2006 - 19:42:38 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US