Re: Yes, this is probably a stupid question but...

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: Sat, 12 Apr 2008 21:25:32 -0700 (PDT)
Message-ID: <18fdad36-0c10-46c3-b151-5ea1515112e6@u12g2000prd.googlegroups.com>


On Apr 12, 9:58 pm, Pat <pat.ca..._at_service-now.com> wrote:
> On Apr 12, 5:51 pm, hpuxrac <johnbhur..._at_sbcglobal.net> wrote:
> > The nested loops technique works pretty well with a limited volume of
> > matching records but if you are really getting counts of 200k at times
> > well it's a non-performer from the get go.
>
> > A hash join might perfom a ( little ) better ... but still may not be
> > adequate.
>
> Unfortunately (from a database efficiency standpoint), exact counts
> are required, or at least expected by our current user base. Summary
> tables are probably not going to work either since I can't predict
> what the count is going to be over. In this case, this user is
> hammering the subset active=1, but tomorrow it may be based on some
> other specifier.
>
> Switching this to a hash join sounds like it has promise (or at least
> be worth a shot).
>
> Is there an optimizer hint I can add to force a hash join in this
> case? Be worth a shot.
>
> I have to admit I am a bit surprised this is all that slow though.
> It's a modern box with 3.0 ghz CPUs in it (intel) and virtually no
> load apart from this query. He can basically monopolize an entire core
> (and he does according to vmstat), and it's still taking him about 2.2
> seconds to finish. Seems like that's an aweful lot of CPU time just to
> join a couple of hundred thousand index nodes together and count the
> result. Virtually no physical IOs involved here either. Far as I can
> tell, this is pure CPU load.
>
> 10.2.0.4 64 bit on RHEL 5.0.

I set up a simple test case to test the suggestion to try a hash join, rather than a nested loops join. Your results may vary slightly from these results.

First, two tables that have sufficiently wide enough columns so that Oracle does not try a full table scan:
CREATE TABLE T1 (
  SYS_ID NUMBER(10) NOT NULL,

  ACTIVE NUMBER(1) NOT NULL,
  C3 VARCHAR2(255),
  C4 VARCHAR2(255),
  C5 VARCHAR2(255),
  C6 VARCHAR2(255),
  C7 VARCHAR2(255),
  C8 VARCHAR2(255),
  C9 VARCHAR2(255),

  C10 VARCHAR2(255)); CREATE UNIQUE INDEX T1_IND1 ON T1(ACTIVE,SYS_ID); CREATE TABLE T2 (
  SYS_ID NUMBER(10) NOT NULL,
  C2 VARCHAR2(255),
  C3 VARCHAR2(255),
  C4 VARCHAR2(255),
  C5 VARCHAR2(255),
  C6 VARCHAR2(255),
  C7 VARCHAR2(255),
  C8 VARCHAR2(255),
  C9 VARCHAR2(255),

  C10 VARCHAR2(255)); CREATE UNIQUE INDEX T2_IND1 ON T2(SYS_ID); Now, insert rows into the T1 table (note that the DECODE contains an error, reverse the last 1 and 0 in the DECODE, but this will work as is by slightly adjusting the SQL statement): INSERT INTO
  T1
SELECT
  ROWNUM,
  DECODE(SIGN(MOD(ROWNUM,100)-84),1,1,0),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255),
  DBMS_RANDOM.STRING('A',255)

FROM
  DUAL
CONNECT BY
  LEVEL<=180000;

Now to save time, we will use the T1 table to build the T2 table: INSERT INTO
  T2
SELECT
  SYS_ID,

  C3,
  C3,
  C4,
  C5,
  C6,
  C7,
  C8,
  C9,

  C10
FROM
  T1;

COMMIT; Now, collect statistics on the tables and indexes: EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE); EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE); Your query, rewritten to work with the above (note that a.active is specified as 0 in this case):
select count(*) as recordcount from (T1 a inner join T2 b on a.sys_id =
b.sys_id) where a.active = 0;

The DBMS_XPLAN showing actual timing:



| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 | 00:00:02.45 | 3000 |
| 2 | NESTED LOOPS | | 1 | 90000 | 153K| 00:00:02.14 | 3000 |
|   3 |    INDEX FAST FULL SCAN| T2_IND1 |      1 |    180K|    180K|
00:00:00.18 |     382 |
|*  4 |    INDEX UNIQUE SCAN   | T1_IND1 |    180K|      1 |    153K|
00:00:00.81 | 2618 |

Predicate Information (identified by operation id):


   4 - access("A"."ACTIVE"=0 AND "A"."SYS_ID"="B"."SYS_ID")

The above shows that the T1_IND1 index was probed 180,000 times. Note that the plan order is not the same as what you posted, likely because your table B (T2) contains more than 1 row per SYS_ID. Plan statistics from a 10046 trace, 3000 consistent reads:

       (Rows 1) SORT AGGREGATE (cr=3000 pr=0 pw=0 time=0 us)   (Rows 153000) NESTED LOOPS (cr=3000 pr=0 pw=0 time=2255352 us cost=121 size=1170000 card=90000)
  (Rows 180000) INDEX FAST FULL SCAN T2_IND1 (cr=382 pr=0 pw=0 time=266066 us cost=104 size=900000 card=180000)   (Rows 153000) INDEX UNIQUE SCAN T1_IND1 (cr=2618 pr=0 pw=0 time=0 us cost=0 size=8 card=1)

Slightly modifying your query with a hint to force a hash join: select /*+ use_hash(a b) */
count(*) as recordcount from (T1 a inner join T2 b on a.sys_id = b.sys_id) where a.active = 0;
The DBMS_XPLAN showing actual timing:



| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
|   1 |  SORT AGGREGATE        |         |      1 |      1 |      1 |
00:00:02.42 |     818 |       |       |          |
|*  2 |   HASH JOIN            |         |      1 |  90000 |    153K|
00:00:02.13 |     818 |  3624K|  1381K| 5510K (0)|
|*  3 |    INDEX FAST FULL SCAN| T1_IND1 |      1 |  90000 |    153K|
00:00:00.16 |     436 |       |       |          |
|   4 |    INDEX FAST FULL SCAN| T2_IND1 |      1 |    180K|    180K|
00:00:00.18 |     382 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):


   2 - access("A"."SYS_ID"="B"."SYS_ID")    3 - filter("A"."ACTIVE"=0)

The above executed 0.03 seconds faster, and the consistent reads dropped to 818.

       (Rows 1) SORT AGGREGATE (cr=818 pr=0 pw=0 time=0 us)   (Rows 153000) HASH JOIN (cr=818 pr=0 pw=0 time=1352636 us cost=458 size=1170000 card=90000)
  (Rows 153000) INDEX FAST FULL SCAN T1_IND1 (cr=436 pr=0 pw=0 time=238965 us cost=120 size=720000 card=90000)   (Rows 180000) INDEX FAST FULL SCAN T2_IND1 (cr=382 pr=0 pw=0 time=278655 us cost=104 size=900000 card=180000)

With just an ORDERED hint to force the same plan order as what you posted:
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |



| 1 | SORT AGGREGATE | | 1 | 1 | 1 | 00:00:02.24 | 3075 |
| 2 | NESTED LOOPS | | 1 | 90000 | 153K| 00:00:01.99 | 3075 |
|*  3 |    INDEX FAST FULL SCAN| T1_IND1 |      1 |  90000 |    153K|
00:00:00.16 |     436 |
|*  4 |    INDEX UNIQUE SCAN   | T2_IND1 |    153K|      1 |    153K|
00:00:00.72 | 2639 |

Predicate Information (identified by operation id):


   3 - filter("A"."ACTIVE"=0)
   4 - access("A"."SYS_ID"="B"."SYS_ID")

0.21 seconds faster than the first execution, even though the consistent reads increased to 3075

       (Rows 1) SORT AGGREGATE (cr=3075 pr=0 pw=0 time=0 us)   (Rows 153000) NESTED LOOPS (cr=3075 pr=0 pw=0 time=2026659 us cost=128 size=1170000 card=90000)
  (Rows 153000) INDEX FAST FULL SCAN T1_IND1 (cr=436 pr=0 pw=0 time=239888 us cost=120 size=720000 card=90000)   (Rows 153000) INDEX UNIQUE SCAN T2_IND1 (cr=2639 pr=0 pw=0 time=0 us cost=0 size=5 card=1)

In the plans, look closely at the timings of each step in the plan. The timings of a child step will roll up into the parent step (indicated by the indentation in the plan). The actual scan of the indexes happens reasonably fast, it is the join that takes the majority of the elapsed time.

Interesting fun with mathematics, which may not be entirely relevant. On a computer with a computer marketed as having a 1333MHz bus speed, using 333MHz quad pumped dual channel memory chips, each memory clock cycle retrieves up to 32 bytes in 0.000000003003003 seconds (maximum transfer speed of 10,162.35 MB per second), and the CPU core will wait for this duration on every memory access. A standard 8KB block requires a minimum of 256 memory clock cycles to be read, resulting in a minimum delay of 0.000000768768768 seconds to read an 8KB block from system memory. If you require the computer to perform 180,000 8KB reads (assuming the data is not cached in the CPU registers, L1, L2, or L3 caches), it will take a minimum of 0.138 seconds (consistent reads of 8KB blocks might take 5-10 times longer). What seems like a simple problem becomes a bit complicated when you dig into the details.

Charles Hooper
IT Manager/Oracle DBA
K&M Machine-Fabricating, Inc. Received on Sat Apr 12 2008 - 23:25:32 CDT

Original text of this message