Oracle FAQ Your Portal to the Oracle Knowledge Grid

Home -> Community -> Mailing Lists -> Oracle-L -> Re: de-dup process

Re: de-dup process

From: A Ebadi <>
Date: Thu, 14 Dec 2006 09:47:03 -0800 (PST)
Message-ID: <>

Cannot clean data before loading as data is from many different sources that don't know about each other.    

  Thanks for everyone that replied and still doing testing to find the best method!

Tony van Lingen <> wrote:   

A Ebadi wrote:

> Biggest problem we've faced in coming up with a solution is none of
> the solutions so far scale. In other words, things are fine if we
> have a 20 million row table with 2-3 million duplicates - runs in
> 10-15 minutes. However, trying it for 100+ million row table - it
> runs for hrs!

You do of course delete non-redoable? When deleting a row, Oracle will create redo info which you, having done a direct load, will not be needed. This'll take time.

> We've even had another tool (Informatica) select out the ROWIDs of the
> duplicates into a separate table then we are using PL/SQL cursor to
> delete those rows from the large table, but this doesn't scale either!

if you mean that deleting 20million rows from a huge tabel is not as fast as deleting 2, then no. Nothing will scale. Try buying more iron and use parallel query.

Why don't you look at cleansing the dataset before loading it? e.g. use 'sort -u' on the file to get rid of duplicate lines. Might be quicker than loading everything and deleting later on...


Tony van Lingen
Tech One Contractor
Information Management
Corporate Development Division
Environmental Protection Agency

Ph: (07) 3234 1972
Fax: (07) 3227 6534
Mobile: 0413 701 284

Visit us online at


WARNING: This e-mail (including any attachments) has originated from a Queensland Government department and may contain information that is confidential, private, or covered by legal professional privilege, and may be protected by copyright. 

You may use this e-mail only if you are the person(s) it was intended to be sent to and if you use it in an authorised way. No one is allowed to use, review, alter, transmit, disclose, distribute, print or copy this e-mail without appropriate authority. If you have received this e-mail in error, please inform the sender immediately by phone or e-mail and delete this e-mail, including any copies, from your computer system network and destroy any hardcopies.

Unless otherwise stated, this e-mail represents the views of the sender and not the views of the Environmental Protection Agency.

Although this e-mail has been checked for the presence of computer viruses, the Environmental Protection Agency provides no warranty that all viruses have been detected and cleaned. Any use of this e-mail could harm your computer system. It is your responsibility to ensure that this e-mail does not contain and is not affected by computer viruses, defects or interference by third parties or replication problems (including incompatibility with your computer system).

E-mails sent to and from the Environmental Protection Agency will be electronically stored, managed and may be audited, in accordance with the law and Queensland Government Information Standards (IS31, IS38, IS40, IS41 and IS42) to the extent they are consistent with the law.


Need a quick answer? Get one in minutes from people who know. Ask your question on Yahoo! Answers.
Received on Thu Dec 14 2006 - 11:47:03 CST

Original text of this message