Home » SQL & PL/SQL » SQL & PL/SQL » Odd case of removing duplicates within a set. (Oracle, 11.2.0.4.0, Windows 7)
Odd case of removing duplicates within a set. [message #646780] Fri, 08 January 2016 10:24 Go to next message
tmcallister
Messages: 107
Registered: December 2007
Senior Member
With data like this

WITH b1 AS
       (SELECT '1' gid, 'A' pid FROM DUAL
        UNION
        SELECT '2', 'A' FROM DUAL
        UNION
        SELECT '2', 'B' FROM DUAL
        UNION
        SELECT '3', 'D' FROM DUAL
        UNION
        SELECT '3', 'E' FROM DUAL
        UNION
        SELECT '3', 'F' FROM DUAL
        UNION
        SELECT '4', 'A' FROM DUAL
        UNION
        SELECT '4', 'B' FROM DUAL
        UNION
        SELECT '5', 'E' FROM DUAL
        UNION
        SELECT '5', 'F' FROM DUAL)
SELECT   *
FROM     b1;


The GIDs denote different sets; and I need to remove both complete duplicates and subsets are are covered by a superset. I.e. GID 1 should be removed since 'A' is part of GIDs 2, 4. One of GID 2, 4 should be removed as they are duplicates. and GID 5 should be removed as 'E'&'F' are included in GID 3.

Desired Output
GID	PID
2	A
2	B
3	D
3	E
3	F

or
GID	PID
3	D
3	E
3	F
4	A
4	B


Thanks!

[Updated on: Fri, 08 January 2016 10:27]

Report message to a moderator

Re: Odd case of removing duplicates within a set. [message #646783 is a reply to message #646780] Fri, 08 January 2016 11:26 Go to previous messageGo to next message
tmcallister
Messages: 107
Registered: December 2007
Senior Member
The following 'works' but absolutely disgusts me. I feel ashamed to provide it; please help me find a better way...

WITH b1 AS
       (SELECT '1' gid, 'A' pid FROM DUAL
        UNION
        SELECT '2', 'A' FROM DUAL
        UNION
        SELECT '2', 'B' FROM DUAL
        UNION
        SELECT '3', 'D' FROM DUAL
        UNION
        SELECT '3', 'E' FROM DUAL
        UNION
        SELECT '3', 'F' FROM DUAL
        UNION
        SELECT '4', 'A' FROM DUAL
        UNION
        SELECT '4', 'B' FROM DUAL
        UNION
        SELECT '5', 'E' FROM DUAL
        UNION
        SELECT '5', 'F' FROM DUAL
        UNION
        SELECT '6', 'E' FROM DUAL
        UNION
        SELECT '6', 'F' FROM DUAL),
     b2 AS
       (SELECT   c.gid cgid, c.pid cpid, d.gid dgid, d.pid dpid, COUNT(DISTINCT c.pid) OVER (PARTITION BY c.gid) ccnt, COUNT(DISTINCT d.pid) OVER (PARTITION BY d.gid) dcnt
        FROM     b1 c JOIN b1 d ON (c.gid != d.gid)),
     b3 AS
       (SELECT     b.*, COUNT(DISTINCT dpid) OVER (PARTITION BY cgid, dgid) dmatch
        FROM       b2 b
        WHERE      cpid = dpid),
     b4 AS
       (SELECT   b.*,
                 CASE
                   WHEN ccnt = dmatch AND dcnt > ccnt THEN 1
                   ELSE 0
                 END
                   is_subset,
                 CASE
                   WHEN ccnt = dcnt AND ccnt = dmatch THEN 1
                   ELSE 0
                 END
                   is_equal
        FROM     b3 b),
     b5 AS
       (SELECT   b.*, LISTAGG(cpid, ',') WITHIN GROUP (ORDER BY cpid) OVER (PARTITION BY cgid, dgid) gset
        FROM     b4 b),
     b6 AS
       (SELECT   cgid,
                 cpid,
                 dgid,
                 dpid,
                 ccnt,
                 dcnt,
                 dmatch,
                 is_subset,
                 MAX(is_subset) OVER (PARTITION BY cgid) has_subset,
                 is_equal,
                 MAX(is_equal) OVER (PARTITION BY cgid) has_equal,
                 RANK() OVER(PARTITION BY gset ORDER BY cgid) equal_id
        FROM     b5 b),
     b7 AS
       (SELECT   DISTINCT cgid
        FROM     b6
        WHERE    has_subset = 0 AND (has_equal = 0 OR (has_equal = 1 AND equal_id = 1)))
SELECT   *
FROM     b1
WHERE    gid IN (SELECT   *
                 FROM     b7);


Note: I added a extra set to deal with multiple equal sets.

[Updated on: Fri, 08 January 2016 11:27]

Report message to a moderator

Re: Odd case of removing duplicates within a set. [message #646888 is a reply to message #646780] Wed, 13 January 2016 12:33 Go to previous messageGo to next message
tmcallister
Messages: 107
Registered: December 2007
Senior Member
So as I expected the query passed requirement testing, but is failing performance. Any tips would be appreciated. I realize this is difficult as (1) I cannot provide real world data, and (2) the issues come from the sub steps in the query.

Consider the present query:

WITH dependency_list AS
       (SELECT   branch_code, product_id, validation_type, validation_item, item_min_value, item_max_value
        FROM     product_dependency
        WHERE    branch_code = 'CGY'
        AND      TRIM(validation_type) = 'INCLUDE'
        AND      TRIM(validation_item) IN ('PKG', 'PT', 'PKGPT')
        AND      (TRIM(product_id) IN :PRODUCT_LIST
        OR        TRIM(item_min_value) IN :PRODUCT_LIST)),
     dependency_hierarchy AS
       (SELECT       CONNECT_BY_ROOT product_id base_product, product_id, item_min_value, item_max_value
        FROM         dependency_list
        CONNECT BY   NOCYCLE TRIM(product_id) = PRIOR TRIM(item_min_value)),
     dependencies_ranked AS (SELECT base_product, product_id, item_min_value, item_max_value, DENSE_RANK() OVER (ORDER BY base_product) GROUP_ID FROM dependency_hierarchy),
     dependencies_unpivot AS
       (SELECT   DISTINCT GROUP_ID, TRIM(product_id) product_id
        FROM     dependencies_ranked UNPIVOT (product_id FOR column_name IN (base_product, product_id, item_min_value, item_max_value))
        WHERE    TRIM(product_id) IN :PRODUCT_LIST),
     dependencies_main AS
       (SELECT   du.GROUP_ID,
                 du.product_id,
                 p.product_class,
                 MAX(CASE
                       WHEN product_class IN ('PRM', 'BTH') THEN 1
                       ELSE 0
                     END)
                 OVER (PARTITION BY GROUP_ID)
                   has_main_class
        FROM     dependencies_unpivot du LEFT JOIN cbs_repl.product p ON (p.branch_code = 'CGY' AND du.product_id = p.product_id)
        WHERE    TRIM(NVL(product_sub_class, '~')) NOT IN ('HDW', 'PRDISC', 'INTERNET')),
     remove_duplicates_subsets AS
       (SELECT   *
        FROM     dependencies_main
        WHERE    has_main_class = 1),
     count_set_elements AS
       (SELECT   d1.GROUP_ID gid1,
                 d1.product_id pid1,
                 d2.GROUP_ID gid2,
                 d2.product_id pid2,
                 COUNT(DISTINCT d1.product_id) OVER (PARTITION BY d1.GROUP_ID) cnt1,
                 COUNT(DISTINCT d2.product_id) OVER (PARTITION BY d2.GROUP_ID) cnt2
        FROM     remove_duplicates_subsets d1 JOIN remove_duplicates_subsets d2 ON (d1.GROUP_ID != d2.GROUP_ID)),
     count_set_matches AS
       (SELECT   cse.*, COUNT(DISTINCT pid2) OVER (PARTITION BY gid1, gid2) set_matches
        FROM     count_set_elements cse
        WHERE    pid1 = pid2),
     mark_matches_subsets AS
       (SELECT   csm.*,
                 MAX(CASE
                       WHEN cnt1 = set_matches AND cnt2 > cnt1 THEN 1
                       ELSE 0
                     END)
                 OVER (PARTITION BY gid1)
                   has_subset,
                 MAX(CASE
                       WHEN cnt1 = cnt2 AND cnt1 = set_matches THEN 1
                       ELSE 0
                     END)
                 OVER (PARTITION BY gid1)
                   has_equal,
                 LISTAGG(pid1, '~') WITHIN GROUP (ORDER BY pid1) OVER (PARTITION BY gid1, gid2) set_str
        FROM     count_set_matches csm),
     pick_among_equals AS
       (SELECT   mms.*, DENSE_RANK() OVER(PARTITION BY set_str ORDER BY gid1) equal_id
        FROM     mark_matches_subsets mms),
     final_selection AS
       (SELECT   DISTINCT gid1
        FROM     pick_among_equals
        WHERE    has_subset = 0 AND (has_equal = 0 OR (has_equal = 1 AND equal_id = 1)))
SELECT   DENSE_RANK() OVER (ORDER BY GROUP_ID) GROUP_ID, product_id, product_class
FROM     remove_duplicates_subsets
WHERE    GROUP_ID IN (SELECT   gid1
                      FROM     final_selection)


I've timed the various steps and the only steps that concern me are the DEPENDENCY_RANKED (4s), DEPENDENCY_UNPIVOT (9s), and DEPENDENCY_MAIN (5s) steps (all other steps <.1s); but I'm at a loss for how to improve them...

Thanks for any help you can offer.

[Updated on: Wed, 13 January 2016 12:34]

Report message to a moderator

Re: Odd case of removing duplicates within a set. [message #646891 is a reply to message #646780] Wed, 13 January 2016 16:25 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
The following is a response to your initial simplified presentation, since we do not have the tables and data for your actual problem. Hopefully, you can adapt it. Basically, it forms collections, then compares the collections eliminating some, then un-nests the results. It is important to note that the user-defined type must match the datatype. I used varchar2(4000), but you could use something else. It does require a user-defined type and doesn't like sys.odcivarchar2list, even though that is the same thing.

SCOTT@orcl> CREATE OR REPLACE TYPE varchar2_tt AS TABLE OF VARCHAR2(4000);
  2  /

Type created.

SCOTT@orcl> COLUMN pid FORMAT A60
SCOTT@orcl> WITH
  2    b1 AS
  3  	    (SELECT '1' gid, CAST ('A' AS VARCHAR2(4000)) pid FROM DUAL
  4  	     UNION ALL
  5  	     SELECT '2', 'A' FROM DUAL
  6  	     UNION ALL
  7  	     SELECT '2', 'B' FROM DUAL
  8  	     UNION ALL
  9  	     SELECT '3', 'D' FROM DUAL
 10  	     UNION ALL
 11  	     SELECT '3', 'E' FROM DUAL
 12  	     UNION ALL
 13  	     SELECT '3', 'F' FROM DUAL
 14  	     UNION ALL
 15  	     SELECT '4', 'A' FROM DUAL
 16  	     UNION ALL
 17  	     SELECT '4', 'B' FROM DUAL
 18  	     UNION ALL
 19  	     SELECT '5', 'E' FROM DUAL
 20  	     UNION ALL
 21  	     SELECT '5', 'F' FROM DUAL),
 22    psets AS
 23  	 (SELECT gid, CAST (COLLECT (DISTINCT pid ORDER BY pid) AS varchar2_tt) pset
 24  	  FROM	 b1
 25  	  GROUP  BY gid),
 26    newsets AS
 27  	 (SELECT p1.*
 28  	  FROM	 psets p1
 29  	  WHERE  NOT EXISTS
 30  		   (SELECT *
 31  		    FROM   psets p2
 32  		    WHERE  p1.gid != p2.gid
 33  		    AND    p1.pset != p2.pset
 34  		    AND    p1.pset MULTISET EXCEPT p2.pset IS EMPTY)
 35  	  AND	 NOT EXISTS
 36  		   (SELECT *
 37  		    FROM   psets p3
 38  		    WHERE  p1.gid > p3.gid
 39  		    AND    p1.pset MULTISET EXCEPT p3.pset IS EMPTY))
 40  SELECT t1.gid, t2.COLUMN_VALUE pid
 41  FROM   newsets t1, TABLE (t1.pset) t2
 42  /

G PID
- ------------------------------------------------------------
2 A
2 B
3 D
3 E
3 F

5 rows selected.

Re: Odd case of removing duplicates within a set. [message #646892 is a reply to message #646891] Wed, 13 January 2016 18:26 Go to previous messageGo to next message
Solomon Yakobson
Messages: 3273
Registered: January 2010
Location: Connecticut, USA
Senior Member
Barbara,

sys.odcivarchar2list & varchar2_tt you showed are not the same. Former is VARRAY while latter is NESTED TABLE. MULTISET operations are not supported for VARRAYs.

SY.
Re: Odd case of removing duplicates within a set. [message #646893 is a reply to message #646892] Wed, 13 January 2016 20:27 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
Ah, that explains it. I couldn't figure out why when I tried to use sys.odcivarchar2list I was getting an error that it was expecting a UDT instead. I completely forgot that sys.odcivarchar2list is a varray, not a nested table. Thanks for pointing it out. Even after I posted it, I was still mulling it over, wondering if maybe I had neglected to set up some permission or something. I got a new computer and did a new database installation and was wondering if I might have missed something.

Re: Odd case of removing duplicates within a set. [message #646916 is a reply to message #646893] Thu, 14 January 2016 08:53 Go to previous message
Solomon Yakobson
Messages: 3273
Registered: January 2010
Location: Connecticut, USA
Senior Member
Yeah,

Error is misleading since sys.odcivarchar2list is UDT (just Oracle supplied) but of wrong type.

SY.
Previous Topic: how to write below procedure
Next Topic: interview question
Goto Forum:
  


Current Time: Thu Apr 18 19:52:59 CDT 2024