Re: How to get a distinct count of result set of multople table joins?

From: Mark D Powell <Mark.Powell_at_eds.com>
Date: Thu, 28 Feb 2008 08:27:23 -0800 (PST)
Message-ID: <3b3b3f9b-d444-4242-a92f-cf2b6b668f31@x30g2000hsd.googlegroups.com>


On Feb 27, 10:03 pm, Charles Hooper <hooperc2..._at_yahoo.com> wrote:
> On Feb 27, 5:43 pm, "Chris ( Val )" <chris..._at_gmail.com> wrote:
>
>
>
>
>
> > On Feb 28, 3:27 am, Charles Hooper <hooperc2..._at_yahoo.com> wrote:
> > > I suspect that Oracle is taking a long time to remove the duplicate
> > > rows due to a combination of the number of rows, the number of columns
> > > (you are specifying to retrieve all columns from the tables), and the
> > > amount of memory available for sorting unique (or hash unique) the
> > > rows to produce a unique list of rows.  Are you able to better define
> > > the columns that are interest, and eliminate any columns that are
> > > common between the various tables.  
>
> > Unfortunatley I did not write the SQL, and I don't understand the
> > data requirements enough to manipulate it. I can ask for it to be
> > changed, but that can take ages to happen :)
>
> > > I have found in some cases on
> > > Oracle 10.2.0.2 that listing all columns in a GROUP BY allows Oracle
> > > to retrieve a distinct list of rows more quickly than by using the
> > > DISTINCT syntax.
>
> > > In the second of your examples, I do not see where you are using the
> > > UNION syntax - are you trying to do something different in that
> > > example?  
>
> > Sorry, I meant that I tried to use the same query as I just posted to
> > Mark,
> > but with the COUNT statement added as in my first post.
>
> > I am having trouble getting the right count returned.
>
> > > Explain plan, or better yet a DBMS Xplan showing the predicted and
> > > actual results would be helpful, as requested by Ed and Mark.  If the
> > > explain plan shows nothing useful, turn on a 10046 trace at level 8
> > > and see what is happening behind the scenes.
>
> > Even though this is a DEV database, I am not allowed to go to those
> > extreams.
>
> > My job is to automate a process of extracting data, of which this SQL
> > was handed
> > to me to use in my script.
>
> > Thanks for your help.
>
> > --
> > Chris
>
> Looking at the plan, 13 tables were full table scanned producing an
> estimated 19,222,632 bytes in 24,271 rows, using 12 hash joins for
> each half of the UNION when the duplicates are removed.  You indicate
> that Oracle is actually returning 700,000 rows before the unique count
> is applied.  This seems to imply that Oracle is actually dealing with
> roughly 554,400,000 bytes for each half of the UNION when the
> duplicates are removed.  Your SQL statement is likely making
> significant use of the TEMP tablespace.  You may be able to improve
> performance by increasing the memory available for sorting and hash
> joins.  Assuming that WORKAREA_SIZE_POLICY is set to AUTO, you might
> be able to improve performance a little by having the DBA increase the
> PGA_AGGREGATE_TARGET.
>
> To see if the above estimates are close, you might try running
> DBMS_XPLAN for the query.  See the following link for an example:
>  http://jonathanlewis.wordpress.com/2006/11/09/dbms_xplan-in-10g/
>
> I think that I now understand what you originally posted.  The two
> queries should be returning the same number of rows before the count,
> since the UNION syntax removes duplicate rows automatically.  Note
> the:
>   "SORT (UNIQUE)" in your execution plan.
>
> Here is a test case that demonstates the performance improvement
> suggestions that I made earlier:
> CREATE TABLE T1 (
>   C1 VARCHAR2(30),
>   C2 VARCHAR2(30),
>   C3 VARCHAR2(30));
>
> CREATE TABLE T2 (
>   C1 VARCHAR2(30),
>   C2 VARCHAR2(30),
>   C3 VARCHAR2(30));
>
> CREATE TABLE T3 (
>   C1 VARCHAR2(30),
>   C2 VARCHAR2(30),
>   C3 VARCHAR2(30));
>
> INSERT INTO
>   T1
> SELECT
>   TO_CHAR(SIN(ROWNUM*3.1415/180)*10000,'0000000000'),
>   TO_CHAR(COS(ROWNUM*3.1415/180)*10000,'0000000000'),
>   TO_CHAR(TAN(ROWNUM*3.1415/180)*10000,'0000000000')
> FROM
>   DUAL
> CONNECT BY
>   LEVEL<=400000;
>
> INSERT INTO
>   T2
> SELECT
>   TO_CHAR(SIN(ROWNUM*4*3.1415/180)*10000,'0000000000'),
>   TO_CHAR(COS(ROWNUM*4*3.1415/180)*10000,'0000000000'),
>   TO_CHAR(TAN(ROWNUM*4*3.1415/180)*10000,'0000000000')
> FROM
>   DUAL
> CONNECT BY
>   LEVEL<=100000;
>
> INSERT INTO
>   T3
> SELECT
>   TO_CHAR(SIN(ROWNUM*7.2*3.1415/180)*10000,'0000000000'),
>   TO_CHAR(COS(ROWNUM*7.2*3.1415/180)*10000,'0000000000'),
>   TO_CHAR(TAN(ROWNUM*7.2*3.1415/180)*10000,'0000000000')
> FROM
>   DUAL
> CONNECT BY
>   LEVEL<=70000;
>
> COMMIT;
>
> EXEC
> DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE);
>
> EXEC
> DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE);
>
> EXEC
> DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T3',CASCADE=>TRUE);
>
> The first test is similar to your first query:
> SELECT
>   COUNT(*)
> FROM
> (SELECT
>   *
> FROM
>   T1,
>   T2,
>   T3
> WHERE
>   T1.C1=T2.C1(+)
>   AND T1.C1=T3.C1(+)
> UNION
> SELECT
>   *
> FROM
>   T1,
>   T2,
>   T3
> WHERE
>   T1.C1=T2.C1(+)
>   AND T1.C1=T3.C1(+));
>
>   COUNT(*)
> ----------
>    2994198
>
> The DBMS Xplan (watch for wordwrapping):
> ---------------------------------------------------------------------------­-------------------------------------------------------------------------
> | Id  | Operation                  | Name | Starts | E-Rows | A-Rows
> |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
> Used-Tmp|
> ---------------------------------------------------------------------------­-------------------------------------------------------------------------
> |   1 |  SORT AGGREGATE            |      |      1 |      1 |      1 |
> 00:03:16.27 |    5392 |  73199 |  73199 |       |       |
> |         |
> |   2 |   VIEW                     |      |      1 |   2806K|   2994K|
> 00:03:09.16 |    5392 |  73199 |  73199 |       |       |
> |         |
> |   3 |    SORT UNIQUE             |      |      1 |   2806K|   2994K|
> 00:03:03.17 |    5392 |  73199 |  73199 |   269M|  8322K|   37M
> (1)|     572K|
> |   4 |     UNION-ALL              |      |      1 |        |   7089K|
> 00:01:32.59 |    5342 |      0 |      0 |       |       |
> |         |
> |*  5 |      HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
> 00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  178K
> (0)|         |
> |   6 |       TABLE ACCESS FULL    | T2   |      1 |      1 |      0 |
> 00:00:00.01 |       3 |      0 |      0 |       |       |
> |         |
> |*  7 |       HASH JOIN RIGHT OUTER|      |      1 |   1403K|   3544K|
> 00:00:07.51 |    2668 |      0 |      0 |  4768K|  2062K| 7035K
> (0)|         |
> |   8 |        TABLE ACCESS FULL   | T3   |      1 |  70000 |  70000 |
> 00:00:00.07 |     402 |      0 |      0 |       |       |
> |         |
> |   9 |        TABLE ACCESS FULL   | T1   |      1 |    400K|    400K|
> 00:00:00.40 |    2266 |      0 |      0 |       |       |
> |         |
> |* 10 |      HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
> 00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  207K
> (0)|         |
> |  11 |       TABLE ACCESS FULL    | T2   |      1 |      1 |      0 |
> 00:00:00.01 |       3 |      0 |      0 |       |       |
> |         |
> |* 12 |       HASH JOIN RIGHT OUTER|      |      1 |   1403K|   3544K|
> 00:00:07.51 |    2668 |      0 |      0 |  4768K|  2062K| 7049K
> (0)|         |
> |  13 |        TABLE ACCESS FULL   | T3   |      1 |  70000 |  70000 |
> 00:00:00.07 |     402 |      0 |      0 |       |       |
> |         |
> |  14 |        TABLE ACCESS FULL   | T1   |      1 |    400K|    400K|
> 00:00:00.40 |    2266 |      0 |      0 |       |       |
> |         |
> ---------------------------------------------------------------------------­-------------------------------------------------------------------------
>
> Predicate Information (identified by operation id):
> ---------------------------------------------------
>    5 - access("T1"."C1"="T2"."C1")
>    7 - access("T1"."C1"="T3"."C1")
>   10 - access("T1"."C1"="T2"."C1")
>   12 - access("T1"."C1"="T3"."C1")
>
> The above took 3 minutes and 16 seconds on Oracle 11.1.0.6
>
> Sample 2 returns the same result, just significantly faster:
> SELECT
>   COUNT(*)
> FROM
> (SELECT DISTINCT
>   *
> FROM
>   T1,
>   T2,
>   T3
> WHERE
>   T1.C1=T2.C1(+)
>   AND T1.C1=T3.C1(+));
>
>   COUNT(*)
> ----------
>    2994198
>
> ---------------------------------------------------------------------------­------------------------------------------------------------------------
> | Id  | Operation                 | Name | Starts | E-Rows | A-Rows
> |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
> Used-Tmp|
> ---------------------------------------------------------------------------­------------------------------------------------------------------------
> |   1 |  SORT AGGREGATE           |      |      1 |      1 |      1 |
> 00:01:18.14 |    2671 |  40365 |  40365 |       |       |
> |         |
> |   2 |   VIEW                    |      |      1 |   1403K|   2994K|
> 00:01:11.67 |    2671 |  40365 |  40365 |       |       |
> |         |
> |   3 |    HASH UNIQUE            |      |      1 |   1403K|   2994K|
> 00:01:05.68 |    2671 |  40365 |  40365 |   351M|    15M|   46M
> (1)|     337K|
> |*  4 |     HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
> 00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  181K
> (0)|         |
> |   5 |      TABLE ACCESS FULL    | T2   |      1 |      1 |      0 |
> 00:00:00.01 |       3 |      0 |      0 |       |       |
> |         |
> |*  6 |      HASH JOIN RIGHT OUTER|      |      1 |   1403K|   3544K|
> 00:00:07.51 |    2668 |      0 |      0 |  4768K|  2062K| 7049K
> (0)|         |
> |   7 |       TABLE ACCESS FULL   | T3   |      1 |  70000 |  70000 |
> 00:00:00.07 |     402 |      0 |      0 |       |       |
> |         |
> |   8 |       TABLE ACCESS FULL   | T1   |      1 |    400K|    400K|
> 00:00:00.40 |    2266 |      0 |      0 |       |       |
> |         |
> ---------------------------------------------------------------------------­------------------------------------------------------------------------
>
> Predicate Information (identified by operation id):
> ---------------------------------------------------
>    4 - access("T1"."C1"="T2"."C1")
>    6 - access("T1"."C1"="T3"."C1")
>
> The third example is one of the performance enhancements that I
> suggested, using GROUP BY, rather than DISTINCT:
> SELECT
>   COUNT(*)
> FROM
> (SELECT
>   T1.C1,
>   T1.C2,
>   T1.C3,
>   T2.C1,
>   T2.C2,
>   T2.C3,
>   T3.C1,
>   T3.C2,
>   T3.C3
> FROM
>   T1,
>   T2,
>   T3
> WHERE
>   T1.C1=T2.C1(+)
>   AND
> ...
>
> read more »- Hide quoted text -
>
> - Show quoted text -

I was under the impression that performance is not an issue but rather it is the fact that the two queries in the initial post are not returning the same result set.

The base query is an outer join. Since an outer join says return all rows from the base table plus the matching data from the joined table or null values when there is not matching row, duplicate output would potentially be expected but duplicate output could also indicate that a required join condition has not been specified.

The join should be exaimined to be sure all join conditions have been properly specified. The query should be checked to make sure it is the right query for the desired output.

Then we are back to if

select query
union
select same_query

and

select count(*) from ( select distinct query)

should produce the same result set counts.

Note that getting the same results set does potentially depend on none of the base tables in the join being updated during the time the two queries are ran else the view of the data will be different.

  • Mark D Powell --
Received on Thu Feb 28 2008 - 10:27:23 CST

Original text of this message