Home » SQL & PL/SQL » SQL & PL/SQL » Odd case of removing duplicates within a set. (Oracle, 11.2.0.4.0, Windows 7)
Odd case of removing duplicates within a set. [message #646780] |
Fri, 08 January 2016 10:24 |
tmcallister
Messages: 107 Registered: December 2007
|
Senior Member |
|
|
With data like this
WITH b1 AS
(SELECT '1' gid, 'A' pid FROM DUAL
UNION
SELECT '2', 'A' FROM DUAL
UNION
SELECT '2', 'B' FROM DUAL
UNION
SELECT '3', 'D' FROM DUAL
UNION
SELECT '3', 'E' FROM DUAL
UNION
SELECT '3', 'F' FROM DUAL
UNION
SELECT '4', 'A' FROM DUAL
UNION
SELECT '4', 'B' FROM DUAL
UNION
SELECT '5', 'E' FROM DUAL
UNION
SELECT '5', 'F' FROM DUAL)
SELECT *
FROM b1;
The GIDs denote different sets; and I need to remove both complete duplicates and subsets are are covered by a superset. I.e. GID 1 should be removed since 'A' is part of GIDs 2, 4. One of GID 2, 4 should be removed as they are duplicates. and GID 5 should be removed as 'E'&'F' are included in GID 3.
Desired Output
GID PID
2 A
2 B
3 D
3 E
3 F
or
GID PID
3 D
3 E
3 F
4 A
4 B
Thanks!
[Updated on: Fri, 08 January 2016 10:27] Report message to a moderator
|
|
|
Re: Odd case of removing duplicates within a set. [message #646783 is a reply to message #646780] |
Fri, 08 January 2016 11:26 |
tmcallister
Messages: 107 Registered: December 2007
|
Senior Member |
|
|
The following 'works' but absolutely disgusts me. I feel ashamed to provide it; please help me find a better way...
WITH b1 AS
(SELECT '1' gid, 'A' pid FROM DUAL
UNION
SELECT '2', 'A' FROM DUAL
UNION
SELECT '2', 'B' FROM DUAL
UNION
SELECT '3', 'D' FROM DUAL
UNION
SELECT '3', 'E' FROM DUAL
UNION
SELECT '3', 'F' FROM DUAL
UNION
SELECT '4', 'A' FROM DUAL
UNION
SELECT '4', 'B' FROM DUAL
UNION
SELECT '5', 'E' FROM DUAL
UNION
SELECT '5', 'F' FROM DUAL
UNION
SELECT '6', 'E' FROM DUAL
UNION
SELECT '6', 'F' FROM DUAL),
b2 AS
(SELECT c.gid cgid, c.pid cpid, d.gid dgid, d.pid dpid, COUNT(DISTINCT c.pid) OVER (PARTITION BY c.gid) ccnt, COUNT(DISTINCT d.pid) OVER (PARTITION BY d.gid) dcnt
FROM b1 c JOIN b1 d ON (c.gid != d.gid)),
b3 AS
(SELECT b.*, COUNT(DISTINCT dpid) OVER (PARTITION BY cgid, dgid) dmatch
FROM b2 b
WHERE cpid = dpid),
b4 AS
(SELECT b.*,
CASE
WHEN ccnt = dmatch AND dcnt > ccnt THEN 1
ELSE 0
END
is_subset,
CASE
WHEN ccnt = dcnt AND ccnt = dmatch THEN 1
ELSE 0
END
is_equal
FROM b3 b),
b5 AS
(SELECT b.*, LISTAGG(cpid, ',') WITHIN GROUP (ORDER BY cpid) OVER (PARTITION BY cgid, dgid) gset
FROM b4 b),
b6 AS
(SELECT cgid,
cpid,
dgid,
dpid,
ccnt,
dcnt,
dmatch,
is_subset,
MAX(is_subset) OVER (PARTITION BY cgid) has_subset,
is_equal,
MAX(is_equal) OVER (PARTITION BY cgid) has_equal,
RANK() OVER(PARTITION BY gset ORDER BY cgid) equal_id
FROM b5 b),
b7 AS
(SELECT DISTINCT cgid
FROM b6
WHERE has_subset = 0 AND (has_equal = 0 OR (has_equal = 1 AND equal_id = 1)))
SELECT *
FROM b1
WHERE gid IN (SELECT *
FROM b7);
Note: I added a extra set to deal with multiple equal sets.
[Updated on: Fri, 08 January 2016 11:27] Report message to a moderator
|
|
|
Re: Odd case of removing duplicates within a set. [message #646888 is a reply to message #646780] |
Wed, 13 January 2016 12:33 |
tmcallister
Messages: 107 Registered: December 2007
|
Senior Member |
|
|
So as I expected the query passed requirement testing, but is failing performance. Any tips would be appreciated. I realize this is difficult as (1) I cannot provide real world data, and (2) the issues come from the sub steps in the query.
Consider the present query:
WITH dependency_list AS
(SELECT branch_code, product_id, validation_type, validation_item, item_min_value, item_max_value
FROM product_dependency
WHERE branch_code = 'CGY'
AND TRIM(validation_type) = 'INCLUDE'
AND TRIM(validation_item) IN ('PKG', 'PT', 'PKGPT')
AND (TRIM(product_id) IN :PRODUCT_LIST
OR TRIM(item_min_value) IN :PRODUCT_LIST)),
dependency_hierarchy AS
(SELECT CONNECT_BY_ROOT product_id base_product, product_id, item_min_value, item_max_value
FROM dependency_list
CONNECT BY NOCYCLE TRIM(product_id) = PRIOR TRIM(item_min_value)),
dependencies_ranked AS (SELECT base_product, product_id, item_min_value, item_max_value, DENSE_RANK() OVER (ORDER BY base_product) GROUP_ID FROM dependency_hierarchy),
dependencies_unpivot AS
(SELECT DISTINCT GROUP_ID, TRIM(product_id) product_id
FROM dependencies_ranked UNPIVOT (product_id FOR column_name IN (base_product, product_id, item_min_value, item_max_value))
WHERE TRIM(product_id) IN :PRODUCT_LIST),
dependencies_main AS
(SELECT du.GROUP_ID,
du.product_id,
p.product_class,
MAX(CASE
WHEN product_class IN ('PRM', 'BTH') THEN 1
ELSE 0
END)
OVER (PARTITION BY GROUP_ID)
has_main_class
FROM dependencies_unpivot du LEFT JOIN cbs_repl.product p ON (p.branch_code = 'CGY' AND du.product_id = p.product_id)
WHERE TRIM(NVL(product_sub_class, '~')) NOT IN ('HDW', 'PRDISC', 'INTERNET')),
remove_duplicates_subsets AS
(SELECT *
FROM dependencies_main
WHERE has_main_class = 1),
count_set_elements AS
(SELECT d1.GROUP_ID gid1,
d1.product_id pid1,
d2.GROUP_ID gid2,
d2.product_id pid2,
COUNT(DISTINCT d1.product_id) OVER (PARTITION BY d1.GROUP_ID) cnt1,
COUNT(DISTINCT d2.product_id) OVER (PARTITION BY d2.GROUP_ID) cnt2
FROM remove_duplicates_subsets d1 JOIN remove_duplicates_subsets d2 ON (d1.GROUP_ID != d2.GROUP_ID)),
count_set_matches AS
(SELECT cse.*, COUNT(DISTINCT pid2) OVER (PARTITION BY gid1, gid2) set_matches
FROM count_set_elements cse
WHERE pid1 = pid2),
mark_matches_subsets AS
(SELECT csm.*,
MAX(CASE
WHEN cnt1 = set_matches AND cnt2 > cnt1 THEN 1
ELSE 0
END)
OVER (PARTITION BY gid1)
has_subset,
MAX(CASE
WHEN cnt1 = cnt2 AND cnt1 = set_matches THEN 1
ELSE 0
END)
OVER (PARTITION BY gid1)
has_equal,
LISTAGG(pid1, '~') WITHIN GROUP (ORDER BY pid1) OVER (PARTITION BY gid1, gid2) set_str
FROM count_set_matches csm),
pick_among_equals AS
(SELECT mms.*, DENSE_RANK() OVER(PARTITION BY set_str ORDER BY gid1) equal_id
FROM mark_matches_subsets mms),
final_selection AS
(SELECT DISTINCT gid1
FROM pick_among_equals
WHERE has_subset = 0 AND (has_equal = 0 OR (has_equal = 1 AND equal_id = 1)))
SELECT DENSE_RANK() OVER (ORDER BY GROUP_ID) GROUP_ID, product_id, product_class
FROM remove_duplicates_subsets
WHERE GROUP_ID IN (SELECT gid1
FROM final_selection)
I've timed the various steps and the only steps that concern me are the DEPENDENCY_RANKED (4s), DEPENDENCY_UNPIVOT (9s), and DEPENDENCY_MAIN (5s) steps (all other steps <.1s); but I'm at a loss for how to improve them...
Thanks for any help you can offer.
[Updated on: Wed, 13 January 2016 12:34] Report message to a moderator
|
|
|
Re: Odd case of removing duplicates within a set. [message #646891 is a reply to message #646780] |
Wed, 13 January 2016 16:25 |
|
Barbara Boehmer
Messages: 9088 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The following is a response to your initial simplified presentation, since we do not have the tables and data for your actual problem. Hopefully, you can adapt it. Basically, it forms collections, then compares the collections eliminating some, then un-nests the results. It is important to note that the user-defined type must match the datatype. I used varchar2(4000), but you could use something else. It does require a user-defined type and doesn't like sys.odcivarchar2list, even though that is the same thing.
SCOTT@orcl> CREATE OR REPLACE TYPE varchar2_tt AS TABLE OF VARCHAR2(4000);
2 /
Type created.
SCOTT@orcl> COLUMN pid FORMAT A60
SCOTT@orcl> WITH
2 b1 AS
3 (SELECT '1' gid, CAST ('A' AS VARCHAR2(4000)) pid FROM DUAL
4 UNION ALL
5 SELECT '2', 'A' FROM DUAL
6 UNION ALL
7 SELECT '2', 'B' FROM DUAL
8 UNION ALL
9 SELECT '3', 'D' FROM DUAL
10 UNION ALL
11 SELECT '3', 'E' FROM DUAL
12 UNION ALL
13 SELECT '3', 'F' FROM DUAL
14 UNION ALL
15 SELECT '4', 'A' FROM DUAL
16 UNION ALL
17 SELECT '4', 'B' FROM DUAL
18 UNION ALL
19 SELECT '5', 'E' FROM DUAL
20 UNION ALL
21 SELECT '5', 'F' FROM DUAL),
22 psets AS
23 (SELECT gid, CAST (COLLECT (DISTINCT pid ORDER BY pid) AS varchar2_tt) pset
24 FROM b1
25 GROUP BY gid),
26 newsets AS
27 (SELECT p1.*
28 FROM psets p1
29 WHERE NOT EXISTS
30 (SELECT *
31 FROM psets p2
32 WHERE p1.gid != p2.gid
33 AND p1.pset != p2.pset
34 AND p1.pset MULTISET EXCEPT p2.pset IS EMPTY)
35 AND NOT EXISTS
36 (SELECT *
37 FROM psets p3
38 WHERE p1.gid > p3.gid
39 AND p1.pset MULTISET EXCEPT p3.pset IS EMPTY))
40 SELECT t1.gid, t2.COLUMN_VALUE pid
41 FROM newsets t1, TABLE (t1.pset) t2
42 /
G PID
- ------------------------------------------------------------
2 A
2 B
3 D
3 E
3 F
5 rows selected.
|
|
|
|
|
|
Goto Forum:
Current Time: Thu Apr 18 19:52:59 CDT 2024
|