RE: PDML and PQ tunning

From: Herring Dave - dherri <Dave.Herring_at_acxiom.com>
Date: Wed, 23 Nov 2011 14:45:51 +0000
Message-ID: <BD475CE0B3EE894DA0CAB36CE2F7DEB4454C9E95_at_LITIGMBCRP02.Corp.Acxiom.net>



Dimitre,

Increasing the right memory parameters may be your best bet, as it seems you've already planned to do so. With parallel operations, you typically need more memory as each PX slave needs memory. Parallel operations work best when you can throw resources at it, pretty much a brute force method. It'd be interesting to see what the PGA advisor says in a statspack report, as in how much of an increase would help, if any.

Sqlplus ARRAYSIZE usually doesn't help much past a value of 300, at least in my testing. And that's only for dealing with results returned to your session. In the production setting, are they really going to be running this query with the expectation of returning 400 MB or so of data (500,000+ rows) to their PC? If not then try to rewrite the query in tests to return no rows yet leave the plan the same. For example, make the main query just an inline view with the outer-query performing some aggregation so only a few rows are returned. Obviously you'll need to run a number of EXPLAIN PLAN FOR ... attempts to make sure the plans haven't changed, as Oracle may optimize the query because of the aggregation.

If the goal really is to return 500,000+ rows to a client's PC, then parallelism won't help as you're still limited by sending data to your PC, so in a sense you're just making more processes wait.

DAVID HERRING
DBA
Acxiom Corporation

EML   dave.herring_at_acxiom.com
TEL    630.944.4762
MBL   630.430.5988 

1501 Opus Pl, Downers Grove, IL 60515, USA WWW.ACXIOM.COM The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please resend this communication to the sender and delete the original message or any copy of it from your computer system. Thank you.

-----Original Message-----
From: Radoulov, Dimitre [mailto:cichomitiko_at_gmail.com] Sent: Tuesday, November 22, 2011 10:32 AM To: Herring Dave - dherri
Cc: oracle-l_at_freelists.org
Subject: Re: PDML and PQ tunning

Hi Dave,
comments and replies in-line.

On 21/11/2011 17:41, Herring Dave - dherri wrote:
> In an ideal world you'd have some sort of test system where you could run the same query over and over, capturing statistics and checking the impact of various parameter changes to see exactly what is happening and why. I've had clients before where there weren't any test systems, so I understand if you're stuck.

Actually there is a test environment: the HW is quite similar, it's only that there are 27 (yes, 27) active Oracle instances on the test machine. With 200m of buffer cache and 25m for PGA it's rather difficult to analyze the execution time.
With manual workarea_size_policy and 60m for sort/hash_areas I got the following error:

ORA-04030: out of process memory when trying to allocate 1033216 bytes (hash-join subh,kllcqas:kllsltba)

> When you captured statistics below with Tanel's "snapper" script, did you limit the SIDs in any way? It's important that with parallel operations you include all the PX slaves, not just the QC.

Yes, my session was the only one when I tested it, but I'll repeat the test to get more details.

>
> Can you run any of the queries on your own? If so, enable event 10046 level 12 beforehand at the session level to capture all waits for both your session (QC) and all the PX slaves. That way you can find out exactly what waits there are for the PX slaves as well.

Unfortunately, as I said, I have rather limited access to the machine, so for now I cannot read the trace files. Sorry for the incomplete information on the wait events
  (actually, the most important detail ...), but right now I can run only a limited sql on the production database ...

Some details:

t1

num rows: 3.008.738
size mb: 3.776

t2

num rows: 2.494.090
size mb: 3.008

Instance configuration:

shared pool 208m

db_cache_size 800m (I've requested to bump it at least to 1.5g, there are 4 other instances on the machine and given the memory currently available, at this point 1.5g seems reasonable ...)

large pool 32m

db_file_multiblock_read_count 16

parallel_automatic_tuning false

N.B. Parallel execution allocates buffers out of the large pool only when PARALLEL_AUTOMATIC_TUNING is set to true.

workarea_size_policy auto

pga_aggregate_target 100m

So I run the statement in serial, parallel 2 and parallel 4. Snapping v$session_wait for the session I'm currently investigating (QC and QX). With parallel 2 there was mostly
direct path read (I've already requested to modify the PGA from 100 to 1024m) and PX Deq: Execution Msg/PX Deq Table Q Normal. With parallel 4 there were
PX qref latch, PX Deq: Execute Reply and PX Deq: Table Q Normal:

  select
   _long_list_of_columns_from_both_tables_ from

   t1 A
  ,t2 b
where

   t1.col1 = t2.col1 ( + )
and

   t1.col2 = t2.col2 ( + )
and

   t2.col1 is null;

  • first run, no parallel --

514654 rows selected.

Elapsed: 00:17:16.84

Execution Plan


    0      SELECT STATEMENT Optimizer=CHOOSE (Cost=153978 Card=3008556
           Bytes=3688489656)

    1    0   FILTER
    2    1     HASH JOIN (OUTER)
    3    2       TABLE ACCESS (FULL) OF 't1' (Cost=23081 Card=
           3008556 Bytes=3393651168)

    4    2       TABLE ACCESS (FULL) OF 't2' (Cost=18506 Ca
           rd=2493915 Bytes=244403670)





Statistics


           0  recursive calls
           0  db block gets
      432452  consistent gets
      735840  physical reads
           0  redo size
   403453605  bytes sent via SQL*Net to client
      382569  bytes received via SQL*Net from client
       34312  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
      514654  rows processed


  • second run, parallel 2 --

514654 rows selected.

Elapsed: 00:17:56.50

Execution Plan


    0      SELECT STATEMENT Optimizer=CHOOSE (Cost=59815 Card=3008556 B
           ytes=3688489656)

    1    0   FILTER*                                                    
:Q572800
                                                                        2

    2    1     HASH JOIN* (OUTER)                                       
:Q572800
                                                                        2

    3    2       TABLE ACCESS* (FULL) OF 't1' (Cost=11541 Card :Q572800
           =3008556 Bytes=3393651168)                                   0

    4    2       TABLE ACCESS* (FULL) OF 't2' (Cost=9253 Ca :Q572800
           rd=2493915 Bytes=244403670)                                  1



    1 PARALLEL_COMBINED_WITH_CHILD
    2 PARALLEL_TO_SERIAL            SELECT /*+ ORDERED NO_EXPAND 
USE_HASH(A2) */
                                     A1.C0,A2.C0,A1.C1,A1.C2,A1.C3,A1.C4

    3 PARALLEL_TO_PARALLEL          SELECT /*+ NO_EXPAND ROWID(A1) */ A1."c1
                                    " C0,A1."c2" C1,A1."c3

    4 PARALLEL_TO_PARALLEL          SELECT /*+ NO_EXPAND ROWID(A1) */ A1."c1
                                    OLA" C0,A1."c2" C1,A1."c3



Statistics


        4184  recursive calls
           4  db block gets
     1665697  consistent gets
      720103  physical reads
         856  redo size
   403563666  bytes sent via SQL*Net to client
      382604  bytes received via SQL*Net from client
       34312  SQL*Net roundtrips to/from client
           2  sorts (memory)
           0  sorts (disk)
      514654  rows processed

I've already tried with different values for sqlplus arraysize (150 - 500) and multi block read count (128), the impact on the execution time was minimal.

At 16:39:16 I run the query with parallel 4. At 17:21:06 I run the following statement a few times, the output was something like this:

   1 select sid, EVENT,P1TEXT,P1RAW,P2TEXT, P2RAW, WAIT_TIME,STATE    2 from v$session_wait
   3 where sid in (20, 24, 15, 18, 22)
   4* order by WAIT_TIME

        SID EVENT      P1TEXT     P1RAW            P2TEXT     
P2RAW             WAIT_TIME STATE
---------- ---------- ---------- ---------------- ---------- 
---------------- ---------- -------------------
         18 PX qref la function   0000000000000001 sleeptime  
0000000000000064          0 WAITING
            tch

         24 PX qref la function   0000000000000001 sleeptime  
0000000000000064          0 WAITING
            tch

         22 PX Deq: Ex sleeptime/ 00000000000000C8 passes     
000000000000001F          0 WAITING
            ecute Repl senderid
            y

         15 PX Deq: Ta sleeptime/ 00000000000000C8 passes     
0000000000000001          0 WAITING
            ble Q Norm senderid
            al

         20 PX Deq: Ta sleeptime/ 00000000000000C8 passes     
0000000000000002          0 WAITING
            ble Q Norm senderid
            al

Then I hit ctrl-c and I interrupted the statement in paralle 4:

ERROR at line 218:
ORA-01013: user requested cancel of current operation

Elapsed: 00:42:58.41
             ^^^^^

I'll try to provide greater details later.

> Also query V$PQ_TQSTAT immediately after the query finishes, in the same session. Or if the system is idle outside of these queries you could also perform manual statspack snapshots before and after.
> Also, the xplans would be helpful in being able to see details in the parallelism columns - TQ|IN-OUT|PQ Distrib. I believe under 9.2.0.6 we ran into a CPU patch that changed the PQ distribution for one particular query, causing a huge performance problem.

I'll try this the next time I have access to the machine.

> It'd be nice to know why there are waits on the PX qref latch, meaning there's potentially an imbalance in the speed of procedures vs. consumers, but if you can't get any details or doing testing then you might need to rely on increasing PEMS. Just note that I believe under 9i if you don't set a value for the large pool then PEMS buffers will come out of the shared pool. PARALLEL_AUTOMATIC_TUNING may need to be set as well, I can't remember.

Thanks!

Best regards
Dimitre

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Nov 23 2011 - 08:45:51 CST

Original text of this message