Re: "Seek and destroy" duplicate entries

From: Alex <dead.man.walking_at_gmx.de>
Date: 16 Oct 2006 06:20:12 -0700
Message-ID: <1161004812.191876.326210@m73g2000cwd.googlegroups.com>

Charles Hooper schrieb:

> Alex wrote:
> > Charles Hooper schrieb:
> >
> > > Alex wrote:
> > > > Hello NG,
> > > >
> > > > considering the following table-structure (import-table):
> > > >
> > > > Material_Nr || Alternative_Hierarchy
> > > > -------------||--------------------
> > > > A || 10xalpha
> > > > A || 10xalpha
> > > > A || 10xalpha
> > > > A || 10xalpha
> > > > B || 20xgamma
> > > > B || 20xgamma
> > > > B || 20xgamma
> > > > B || 20xgamma
> > > > ...
> > > >
> > > > The data I've to import isn't normalized. Therefore, the original
> > > > datasets are inflated to nearly 140.000.
> > > > While I'm able to delete about 130.000 datasets by killing those, who
> > > > have NULL as "Alternative_Hierarchy", I've still got >8.000 datasets
> > > > left.
> > > >
> > > > But a lot of these are still duplicates!
> > > > Having only the mentioned columns, I failed to apply the method of
> > > > S-A-D using the MIN()/MAX() functions.
> > > >
> > > > Can anyone provide me with a solution of killing the duplicates (except
> > > > one) ?
> > > > Any help will be appreciated!
> > > >
> > > >
> > > > Alex Sauer
> > >
> > > The following should retrieve one row per MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY combination:
> > > SELECT
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY,
> > > MIN(ROWID) ROW_ID
> > > FROM
> > > MY_TABLE
> > > GROUP BY
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY;
> > >
> > > Using the above as a starting point:
> > > SELECT
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY,
> > > ROWID ROW_ID
> > > FROM
> > > MY_TABLE
> > > MINUS
> > > SELECT
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY,
> > > MIN(ROWID) ROW_ID
> > > FROM
> > > MY_TABLE
> > > GROUP BY
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY;
> > >
> > > The above SQL statement retrieves the MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY of all rows, and then eliminates all of those
> > > that were contained in the original query. Now to make use of the
> > > results:
> > > DELETE FROM
> > > MY_TABLE
> > > WHERE
> > > (MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY,
> > > ROWID)
> > > IN
> > > (SELECT
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY,
> > > ROWID ROW_ID
> > > FROM
> > > MY_TABLE
> > > MINUS
> > > SELECT
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY,
> > > MIN(ROWID) ROW_ID
> > > FROM
> > > MY_TABLE
> > > GROUP BY
> > > MATERIAL_NR,
> > > ALTERNATIVE_HIERARCHY);
> > >
> > > Obviously, test the results before executing the DELETE.
> > >
> > > Charles Hooper
> > > PC Support Specialist
> > > K&M Machine-Fabricating, Inc.
> >
> > Thank you! I'll try that solution.
> > Can you tell me which one is speedier? "SELECT... MINUS SELECT ..." or
> > "SELECT ... WHERE ... NOT IN (SELECT...)"?
> >
> >
> > THX
> > Alex Sauer

> SELECT ... WHERE ... NOT IN (SELECT...) _should_ be much slower. The
> NOT IN (SELECT) clause will be executed/evaluated once per row of the
> input table.

> Here is another way to find the rows to be deleted:
> SELECT
> MATERIAL_NR,
> ALTERNATIVE_HIERARCHY,
> ROW_ID
> FROM
> (SELECT
> MATERIAL_NR,
> ALTERNATIVE_HIERARCHY,
> ROWID ROW_ID
> FROM
> MY_TABLE) M1,
> (SELECT
> MATERIAL_NR,
> ALTERNATIVE_HIERARCHY,
> MIN(ROWID) ROW_ID
> FROM
> MY_TABLE
> GROUP BY
> MATERIAL_NR,
> ALTERNATIVE_HIERARCHY) M2
> WHERE
> M1.MATERIAL=M2.MATERIAL(+)
> AND M1.ALTERNATIVE_HIERARCHY=M2.ALTERNATIVE_HIERARCHY(+)
> AND M1.ROW_ID=M2.ROW_ID(+)
> AND M2.ROW_ID IS NULL;

> The above creates a left outer join between the two SQL statements, and
> the "M2.ROW_ID IS NULL" specification basically means, find all rows
> that are in the first result set that are not in the second. This
> effectively does the same thing as the MINUS. Now is this faster than
> the MINUS method? a 10046 trace should be able to tell you.

> Charles Hooper
> PC Support Specialist
> K&M Machine-Fabricating, Inc.

Thanks again!

Alex Sauer Received on Mon Oct 16 2006 - 08:20:12 CDT