Re: Remove duplicates from a field possible?

From: Mladen Gogala <gogala_at_sbcglobal.net>
Date: Mon, 06 Mar 2006 21:24:42 GMT
Message-Id: <pan.2006.03.06.21.24.40.976859@sbcglobal.net>

On Mon, 06 Mar 2006 17:56:53 +0100, Michel Cadot wrote:

>
> "Mark Milke" <mark_milke_at_yahoo.com> a écrit dans le message de news: 1141630427.043643.309240_at_u72g2000cwu.googlegroups.com...
> | Hi all,
> |
> | we store in a var field values in the format 123; 345; 567; In some
> | records the field contains duplicates, i.e. 123; 345; 123; 567; Is it
> | possible to selelct those records somehow, or even better get rid of
> | the duplicates?
> |
> | Mark
> |
>
> SQL> select * from t order by id;
> ID VALS
> ---------- ------------------------------
> 1 123; 345; 123; 567;
> 2 abc; cde; abc; efg;
>
> 2 rows selected.
>
> SQL> def max_vals=10
> SQL> col vals format a30
> SQL> with
> 2 step1 as (
> 3 select id, vals, rn
> 4 from t, (select rownum rn from dual connect by level <= &max_vals)
> 5 ),
> 6 step2 as (
> 7 select distinct
> 8 id,
> 9 trim(substr(vals,
> 10 decode(rn,1,0,instr(vals,';',1,rn-1))+1,
> 11 instr(vals,';',1,rn)
> 12 -decode(rn,1,0,instr(vals,';',1,rn-1))-1)) val
> 13 from step1
> 14 where instr(vals,';',1,rn) > 0
> 15 ),
> 16 step3 as (
> 17 select id, val,
> 18 row_number () over (partition by id order by val) curr,
> 19 row_number () over (partition by id order by val)-1 prev
> 20 from step2
> 21 )
> 22 select id, max(substr(sys_connect_by_path(val,';'),2))||';' vals
> 23 from step3
> 24 connect by prior id = id and prior curr = prev
> 25 start with prev = 0
> 26 group by id
> 27 /
> ID VALS
> ---------- ------------------------------
> 1 123;345;567;
> 2 abc;cde;efg;
>
> 2 rows selected.
>
> Regards
> Michel Cadot

Allow me to propose a modified kyteian solution. Modification consists of using the EXCEPTIONS table, which will significantly speed up the whole thing in cases where the base table is large and there are relatively few duplicate rows. Column naming (mad_max) was my own and not Tom's. I call it a kyteian solution, because Tom Kyte wrote about using the analytic functions for duplicate rows removal in his book "Effective Oracle by Design". If the underlying table has 100M rows and there are 20,000 duplicates, my procedure will be the fastest way of doing it.

SQL> create table emp1 as select * from emp;

Table created.

SQL> insert into emp1 select * from emp;

14 rows created.

SQL> / 14 rows created.

SQL> commit;

Commit complete.

SQL> @?/rdbms/admin/utlexcpt

Table created.

SQL> alter table emp1 add constraint emp1_pk primary key(empno) 2 exceptions into exceptions;
alter table emp1 add constraint emp1_pk primary key(empno)

ERROR at line 1:
ORA-02437: cannot validate (SCOTT.EMP1_PK) - primary key violated

SQL> delete from emp1 where rowid in
2 (select rowid from

  3      (select rowid,
  4              max(rowid) over (partition by empno) as mad_max
  5       from emp1
  6       where rowid in (select row_id from exceptions))

7 where rowid<mad_max);

28 rows deleted.

SQL> commit;

Commit complete.

SQL> alter table emp1 add constraint emp1_pk primary key(empno) 2 /

Table altered.

SQL> select count(*) from emp1;

COUNT(*)

SQL>

-- 
http://www.mgogala.com

Received on Mon Mar 06 2006 - 15:24:42 CST