Re: "Seek and destroy" duplicate entries

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: 16 Oct 2006 06:13:05 -0700
Message-ID: <1161004384.987617.311430@b28g2000cwb.googlegroups.com>

Alex wrote:
> Charles Hooper schrieb:
>
> > Alex wrote:
> > > Hello NG,
> > >
> > > considering the following table-structure (import-table):
> > >
> > > Material_Nr || Alternative_Hierarchy
> > > -------------||--------------------
> > > A || 10xalpha
> > > A || 10xalpha
> > > A || 10xalpha
> > > A || 10xalpha
> > > B || 20xgamma
> > > B || 20xgamma
> > > B || 20xgamma
> > > B || 20xgamma
> > > ...
> > >
> > > The data I've to import isn't normalized. Therefore, the original
> > > datasets are inflated to nearly 140.000.
> > > While I'm able to delete about 130.000 datasets by killing those, who
> > > have NULL as "Alternative_Hierarchy", I've still got >8.000 datasets
> > > left.
> > >
> > > But a lot of these are still duplicates!
> > > Having only the mentioned columns, I failed to apply the method of
> > > S-A-D using the MIN()/MAX() functions.
> > >
> > > Can anyone provide me with a solution of killing the duplicates (except
> > > one) ?
> > > Any help will be appreciated!
> > >
> > >
> > > Alex Sauer
> >
> > The following should retrieve one row per MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY combination:
> > SELECT
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY,
> > MIN(ROWID) ROW_ID
> > FROM
> > MY_TABLE
> > GROUP BY
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY;
> >
> > Using the above as a starting point:
> > SELECT
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY,
> > ROWID ROW_ID
> > FROM
> > MY_TABLE
> > MINUS
> > SELECT
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY,
> > MIN(ROWID) ROW_ID
> > FROM
> > MY_TABLE
> > GROUP BY
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY;
> >
> > The above SQL statement retrieves the MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY of all rows, and then eliminates all of those
> > that were contained in the original query. Now to make use of the
> > results:
> > DELETE FROM
> > MY_TABLE
> > WHERE
> > (MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY,
> > ROWID)
> > IN
> > (SELECT
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY,
> > ROWID ROW_ID
> > FROM
> > MY_TABLE
> > MINUS
> > SELECT
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY,
> > MIN(ROWID) ROW_ID
> > FROM
> > MY_TABLE
> > GROUP BY
> > MATERIAL_NR,
> > ALTERNATIVE_HIERARCHY);
> >
> > Obviously, test the results before executing the DELETE.
> >
> > Charles Hooper
> > PC Support Specialist
> > K&M Machine-Fabricating, Inc.
>
> Thank you! I'll try that solution.
> Can you tell me which one is speedier? "SELECT... MINUS SELECT ..." or
> "SELECT ... WHERE ... NOT IN (SELECT...)"?
>
>
> THX
> Alex Sauer

SELECT ... WHERE ... NOT IN (SELECT...) _should_ be much slower. The NOT IN (SELECT) clause will be executed/evaluated once per row of the input table.

Here is another way to find the rows to be deleted: SELECT
  MATERIAL_NR,
  ALTERNATIVE_HIERARCHY,
  ROW_ID
FROM
  (SELECT
    MATERIAL_NR,
    ALTERNATIVE_HIERARCHY,
    ROWID ROW_ID
  FROM
    MY_TABLE) M1,
  (SELECT
    MATERIAL_NR,
    ALTERNATIVE_HIERARCHY,
    MIN(ROWID) ROW_ID
  FROM
    MY_TABLE
  GROUP BY
    MATERIAL_NR,
    ALTERNATIVE_HIERARCHY) M2
WHERE
  M1.MATERIAL=M2.MATERIAL(+)

  AND M1.ALTERNATIVE_HIERARCHY=M2.ALTERNATIVE_HIERARCHY(+)
  AND M1.ROW_ID=M2.ROW_ID(+)
  AND M2.ROW_ID IS NULL;

The above creates a left outer join between the two SQL statements, and the "M2.ROW_ID IS NULL" specification basically means, find all rows that are in the first result set that are not in the second. This effectively does the same thing as the MINUS. Now is this faster than the MINUS method? a 10046 trace should be able to tell you.

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc. Received on Mon Oct 16 2006 - 08:13:05 CDT