Oracle FAQ Your Portal to the Oracle Knowledge Grid

Home -> Community -> Mailing Lists -> Oracle-L -> RE: de-dup process

RE: de-dup process

From: Ken Naim <>
Date: Thu, 14 Dec 2006 13:35:28 -0500
Message-ID: <003e01c71fae$a53e90e0$66b016ac@KenHome>

For the initial load use an external table and do an insert as select distinct. After the initial load use an external table for the file and use the merge statement to only insert the ones you want or you can use union and insert it into a third table and then rename the tables and truncate the original and swap back and forth nightly between the two. The select form the external table can be optimized to exclude value known to be duplicates, i.e.anything over 6 months old or whatever criteria makes sense for you.  


From: [] On Behalf Of A Ebadi
Sent: Thursday, December 14, 2006 12:47 PM To: Tony van Lingen; Cc:
Subject: Re: de-dup process  

Cannot clean data before loading as data is from many different sources that don't know about each other.  

Thanks for everyone that replied and still doing testing to find the best method!

Tony van Lingen <> wrote:

A Ebadi wrote:

> Biggest problem we've faced in coming up with a solution is none of
> the solutions so far scale. In other words, things are fine if we
> have a 20 million row table with 2-3 million duplicates - runs in
> 10-15 minutes. However, trying it for 100+ million row table - it
> runs for hrs!

You do of course delete non-redoable? When deleting a row, Oracle will create redo info which you, having done a direct load, will not be needed. This'll take time.

> We've even had another tool (Informatica) select out the ROWIDs of the
> duplicates into a separate table then we are using PL/SQL cursor to
> delete those rows from the large table, but this doesn't scale either!

if you mean that deleting 20million rows from a huge tabel is not as fast as deleting 2, then no. Nothing will scale. Try buying more iron and use parallel query.

Why don't you look at cleansing the dataset before loading it? e.g. use 'sort -u' on the file to get rid of duplicate lines. Might be quicker than loading everything and deleting later on...


Tony van Lingen
Tech One Contractor
Information Management
Corporate Development Division
Environmental Protection Agency

Ph: (07) 3234 1972
Fax: (07) 3227 6534
Mobile: 0413 701 284

Visit us online at


WARNING: This e-mail (including any attachments) has originated from a
Queensland Government department and may contain information that is
confidential, private, or covered by legal professional privilege, and may
be protected by copyright. 

You may use this e-mail only if you are the person(s) it was intended to be
sent to and if you use it in an authorised way. No one is allowed to use,
review, alter, transmit, disclose, distribute, print or copy this e-mail
without appropriate authority. If you have received this e-mail in error,
please inform the sender immediately by phone or e-mail and delete this
e-mail, including any copies, from your computer system network and destroy
any hardcopies.

Unless otherwise stated, this e-mail represents the views of the sender and
not the views of the Environmental Protection Agency.

Although this e-mail has been checked for the presence of computer viruses,
the Environmental Protection Agency provides no warranty that all viruses
have been detected and cleaned. Any use of this e-mail could harm your
computer system. It is your responsibility to ensure that this e-mail does
not contain and is not affected by computer viruses, defects or interference
by third parties or replication problems (including incompatibility with
your computer system).

E-mails sent to and from the Environmental Protection Agency will be
electronically stored, managed and may be audited, in accordance with the
law and Queensland Government Information Standards (IS31, IS38, IS40, IS41
and IS42) to the extent they are consistent with the law.


Need a quick answer? Get one in minutes from people who know. Ask your
question on Yahoo!
NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx>  Answers.

Received on Thu Dec 14 2006 - 12:35:28 CST

Original text of this message