Re: find duplicates

From: Billy Verreynne <vslabs_at_onwe.co.za>
Date: Wed, 15 Jan 2003 11:41:19 +0200
Message-ID: <b033b8$r6u$1@ctb-nnrp2.saix.net>

s Lehaire wrote:

> Hi I have to create a primary key on a table but I've got a duplicate data
> in this table
> so i try to find a sql request to find the duplicat rows
> can you help me ?

IMO all the answers thus far is partially right. Partially because they only address the identification issue. They do not adress actually finding the actual duplicate rows.

There's a difference. What do you do after you run that SELECT GROUP BY sql statement? Does the GROUP BY allow you immediate identification of duplicates? Nope. It tells you how bad the dup problem is - nothing more.. unless you the GROUP BY to run another SQL to extract or identify the duplicate row.

What I want is the actual row identification of duplicates from the start, allowing me to inspect these and determine which row you want and which row I want to delete.

I do the following then dealing with dups.

CREATE TABLE dups
NOLOGGING AS
SELECT
t1.rowid "BASE_ROW_ID",
t2.rowid "DUP_ROW_ID
FROM foo t1,

foo t2

WHERE t1.col1 = t2.col1
AND   t1.col2 = t2.col2
AND   t1.rowid < t2.rowid

Identifying the dups is done via joining the columns that are duplicated. In the above example, columns 1 and 2.

To ensure that I do not join row 1 with itself as a duplicate, I add the rowid check.

If there are multiple duplicates, then each combo will be represented. E.g. if rows 1, 3 and 5 are duplicates, the above result will have rows 1 & 3 as a duplicate, row 1 & 5 and row 3 & 5.

If you want the results to only contain row 1 & 3 and row 1 & 5, then you can wrap the above SELECT into a GROUP BY when creating the result table, e.g.

CREATE TABLE dups
NOLOGGING AS
SELECT
  MIN(base_row_id) "BASE_ROW_ID",
  dup_row_id
FROM (
  SELECT
    t1.rowid "BASE_ROW_ID",
    t2.rowid "DUP_ROW_ID
  FROM foo t1,

foo t2

  WHERE t1.col1 = t2.col1
  AND   t1.col2 = t2.col2
  AND   t1.rowid < t2.rowid   
     )

GROUP BY
dup_row_id

As the results contains the actual row identifier, you have fast and direct access to the source rows for displaying, deleting, updating, etc.

--
Billy

Received on Wed Jan 15 2003 - 03:41:19 CST