Re: PDML and PQ tunning

From: Radoulov, Dimitre <cichomitiko_at_gmail.com>
Date: Tue, 22 Nov 2011 17:32:11 +0100
Message-ID: <4ECBCE8B.90600_at_gmail.com>



Hi Dave,
comments and replies in-line.

On 21/11/2011 17:41, Herring Dave - dherri wrote:
> In an ideal world you'd have some sort of test system where you could run the same query over and over, capturing statistics and checking the impact of various parameter changes to see exactly what is happening and why. I've had clients before where there weren't any test systems, so I understand if you're stuck.

Actually there is a test environment: the HW is quite similar, it's only that there are 27 (yes, 27) active Oracle instances on the test machine. With 200m of buffer cache and 25m for PGA it's rather difficult to analyze the execution time.
With manual workarea_size_policy and 60m for sort/hash_areas I got the following error:

ORA-04030: out of process memory when trying to allocate 1033216 bytes (hash-join subh,kllcqas:kllsltba)

> When you captured statistics below with Tanel's "snapper" script, did you limit the SIDs in any way? It's important that with parallel operations you include all the PX slaves, not just the QC.

Yes, my session was the only one when I tested it, but I'll repeat the test to get more details.

>
> Can you run any of the queries on your own? If so, enable event 10046 level 12 beforehand at the session level to capture all waits for both your session (QC) and all the PX slaves. That way you can find out exactly what waits there are for the PX slaves as well.

Unfortunately, as I said, I have rather limited access to the machine, so for now I cannot read the trace files. Sorry for the incomplete information on the wait events
  (actually, the most important detail ...), but right now I can run only a limited sql on the production database ...

Some details:

t1

num rows: 3.008.738
size mb: 3.776

t2

num rows: 2.494.090
size mb: 3.008

Instance configuration:

shared pool 208m

db_cache_size 800m (I've requested to bump it at least to 1.5g, there are 4 other instances on the machine and given the memory currently available, at this point 1.5g seems reasonable ...)

large pool 32m

db_file_multiblock_read_count 16

parallel_automatic_tuning false

N.B. Parallel execution allocates buffers out of the large pool only when PARALLEL_AUTOMATIC_TUNING is set to true.

workarea_size_policy auto

pga_aggregate_target 100m

So I run the statement in serial, parallel 2 and parallel 4. Snapping v$session_wait for the session I'm currently investigating (QC and QX). With parallel 2 there was mostly
direct path read (I've already requested to modify the PGA from 100 to 1024m) and PX Deq: Execution Msg/PX Deq Table Q Normal. With parallel 4 there were
PX qref latch, PX Deq: Execute Reply and PX Deq: Table Q Normal:

  select
   _long_list_of_columns_from_both_tables_ from

   t1 A
  ,t2 b
where

   t1.col1 = t2.col1 ( + )
and

   t1.col2 = t2.col2 ( + )
and

   t2.col1 is null;

  • first run, no parallel --

514654 rows selected.

Elapsed: 00:17:16.84

Execution Plan


    0      SELECT STATEMENT Optimizer=CHOOSE (Cost=153978 Card=3008556
           Bytes=3688489656)

    1    0   FILTER
    2    1     HASH JOIN (OUTER)
    3    2       TABLE ACCESS (FULL) OF 't1' (Cost=23081 Card=
           3008556 Bytes=3393651168)

    4    2       TABLE ACCESS (FULL) OF 't2' (Cost=18506 Ca
           rd=2493915 Bytes=244403670)





Statistics


           0  recursive calls
           0  db block gets
      432452  consistent gets
      735840  physical reads
           0  redo size
   403453605  bytes sent via SQL*Net to client
      382569  bytes received via SQL*Net from client
       34312  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
      514654  rows processed


  • second run, parallel 2 --

514654 rows selected.

Elapsed: 00:17:56.50

Execution Plan


    0      SELECT STATEMENT Optimizer=CHOOSE (Cost=59815 Card=3008556 B
           ytes=3688489656)

    1    0   FILTER*                                                    
:Q572800
                                                                        2

    2    1     HASH JOIN* (OUTER)                                       
:Q572800
                                                                        2

    3    2       TABLE ACCESS* (FULL) OF 't1' (Cost=11541 Card :Q572800
           =3008556 Bytes=3393651168)                                   0

    4    2       TABLE ACCESS* (FULL) OF 't2' (Cost=9253 Ca :Q572800
           rd=2493915 Bytes=244403670)                                  1



    1 PARALLEL_COMBINED_WITH_CHILD
    2 PARALLEL_TO_SERIAL            SELECT /*+ ORDERED NO_EXPAND 
USE_HASH(A2) */
                                     A1.C0,A2.C0,A1.C1,A1.C2,A1.C3,A1.C4

    3 PARALLEL_TO_PARALLEL          SELECT /*+ NO_EXPAND ROWID(A1) */ A1."c1
                                    " C0,A1."c2" C1,A1."c3

    4 PARALLEL_TO_PARALLEL          SELECT /*+ NO_EXPAND ROWID(A1) */ A1."c1
                                    OLA" C0,A1."c2" C1,A1."c3



Statistics


        4184  recursive calls
           4  db block gets
     1665697  consistent gets
      720103  physical reads
         856  redo size
   403563666  bytes sent via SQL*Net to client
      382604  bytes received via SQL*Net from client
       34312  SQL*Net roundtrips to/from client
           2  sorts (memory)
           0  sorts (disk)
      514654  rows processed

I've already tried with different values for sqlplus arraysize (150 - 500) and multi block read count (128), the impact on the execution time was minimal.

At 16:39:16 I run the query with parallel 4. At 17:21:06 I run the following statement a few times, the output was something like this:

   1 select sid, EVENT,P1TEXT,P1RAW,P2TEXT, P2RAW, WAIT_TIME,STATE    2 from v$session_wait
   3 where sid in (20, 24, 15, 18, 22)
   4* order by WAIT_TIME

        SID EVENT      P1TEXT     P1RAW            P2TEXT     
P2RAW             WAIT_TIME STATE
---------- ---------- ---------- ---------------- ---------- 
---------------- ---------- -------------------
         18 PX qref la function   0000000000000001 sleeptime  
0000000000000064          0 WAITING
            tch

         24 PX qref la function   0000000000000001 sleeptime  
0000000000000064          0 WAITING
            tch

         22 PX Deq: Ex sleeptime/ 00000000000000C8 passes     
000000000000001F          0 WAITING
            ecute Repl senderid
            y

         15 PX Deq: Ta sleeptime/ 00000000000000C8 passes     
0000000000000001          0 WAITING
            ble Q Norm senderid
            al

         20 PX Deq: Ta sleeptime/ 00000000000000C8 passes     
0000000000000002          0 WAITING
            ble Q Norm senderid
            al

Then I hit ctrl-c and I interrupted the statement in paralle 4:

ERROR at line 218:
ORA-01013: user requested cancel of current operation

Elapsed: 00:42:58.41
             ^^^^^

I'll try to provide greater details later.

> Also query V$PQ_TQSTAT immediately after the query finishes, in the same session. Or if the system is idle outside of these queries you could also perform manual statspack snapshots before and after.
> Also, the xplans would be helpful in being able to see details in the parallelism columns - TQ|IN-OUT|PQ Distrib. I believe under 9.2.0.6 we ran into a CPU patch that changed the PQ distribution for one particular query, causing a huge performance problem.

I'll try this the next time I have access to the machine.

> It'd be nice to know why there are waits on the PX qref latch, meaning there's potentially an imbalance in the speed of procedures vs. consumers, but if you can't get any details or doing testing then you might need to rely on increasing PEMS. Just note that I believe under 9i if you don't set a value for the large pool then PEMS buffers will come out of the shared pool. PARALLEL_AUTOMATIC_TUNING may need to be set as well, I can't remember.

Thanks!

Best regards
Dimitre

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Nov 22 2011 - 10:32:11 CST

Original text of this message