Skip navigation.

Randolf Geist

Syndicate content
Updated: 11 hours 31 min ago

Parallel Projection

Sun, 2015-08-16 08:09
A recent case at a client reminded me of something that isn't really new but not so well known - Oracle by default performs evaluation at the latest possible point in the execution plan.

So if you happen to have expressions in the projection of a simple SQL statement that runs parallel it might be counter-intuitive that by default Oracle won't evaluate the projection in the Parallel Slaves but in the Query Coordinator - even if it was technically possible - because the latest possible point is the SELECT operation with the ID = 0 of the plan, which is always performed by the Query Coordinator.

Of course, if you make use of expressions that can't be evaluated in parallel or aren't implemented for parallel evaluation, then there is no other choice than doing this in the Query Coordinator.

The specific case in question was a generic export functionality that allowed exporting report results to some CSV or Excel like format, and some of these reports had a lot of rows and complex - in that case CPU intensive - expressions in their projection clause.

When looking at the run time profile of such an export query it became obvious that although it was a (very simple) parallel plan, all of the time was spent in the Query Coordinator, effectively turning this at runtime into a serial execution.

This effect can be reproduced very easily:

create table t_1
compress
as
select /*+ use_nl(a b) */
rownum as id
, rpad('x', 100) as filler
from
(select /*+ cardinality(1e5) */ * from dual
connect by
level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel cache;

-- Run some CPU intensive expressions in the projection
-- of a simple parallel Full Table Scan
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select
regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
;

-- The plan is clearly parallel
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2000K| 192M| 221 (1)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM)| :TQ10000 | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 3 | PX BLOCK ITERATOR | | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 4 | TABLE ACCESS FULL| T_1 | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------------

-- But the runtime profile looks more serial
-- although the Parallel Slaves get used to run the Full Table Scan
-- All time spent in the operation ID = 0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Operation | Name | Execs | A-Rows| ReadB | ReadReq | Start | Dur(T)| Dur(A)| Time Active Graph | Parallel Distribution ASH | Parallel Execution Skew ASH | Activity Graph ASH | Top 5 Activity ASH |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | SELECT STATEMENT | | 5 | 2000K | | | 3 | 136 | 120 | #################### | 1:sqlplus.exe(120)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0] | ################################ | @@@@@@@@@@@@@@@@@@@ ( 98%) | ON CPU(120) |
| 1 | 0 | PX COORDINATOR | | 5 | 2000K | | | 119 | 1 | 1 | # | 1:sqlplus.exe(1)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0] | | ( .8%) | ON CPU(1) |
| 2 | 1 | PX SEND QC (RANDOM)| :TQ10000 | 4 | 2000K | | | 66 | 11 | 2 | ## | 2:P00B(1)[508K],P00A(1)[490K],P008(0)[505K],P009(0)[497K],sqlplus.exe(0)[0] | | (1.6%) | PX qref latch(2) |
| 3 | 2 | PX BLOCK ITERATOR | | 4 | 2000K | | | | | | | 0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] | | | |
|* 4 | 3 | TABLE ACCESS FULL| T_1 | 52 | 2000K | 23M | 74 | | | | | 0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] | | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fortunately there is a simple and straightforward way to make use of the Parallel Slaves for evaluation of projection expressions that can be evaluated in parallel - simply add a suitable NO_MERGE hint for the query block that you want the projection to be evaluated for in the Parallel Slaves.

If you don't want to have side effects on the overall plan shape by not merging views you could always wrap the original query in an outer SELECT and not merging the now inner query block. There seems to be a rule that the projection of a view always get evaluated at the VIEW operator, and if we check the execution plan we can see that the VIEW operator is marked parallel:

set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select /*+ no_merge(x) */ * from (
select
regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
) x
;

-- View operator is marked parallel
-- This is were the projection clause of the VIEW will be evaluated
---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2000K| 11G| 221 (1)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10000 | 2000K| 11G| 221 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 3 | VIEW | | 2000K| 11G| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
| 4 | PX BLOCK ITERATOR | | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 5 | TABLE ACCESS FULL| T_1 | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
---------------------------------------------------------------------------------------------------------------

-- Runtime profile now shows effective usage of Parallel Slaves
-- for doing the CPU intensive work
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Operation | Name | Execs | A-Rows| Start | Dur(T)| Dur(A)| Time Active Graph | Parallel Distribution ASH | Parallel Execution Skew ASH| Activity Graph ASH | Top 5 Activity ASH |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | SELECT STATEMENT | | 5 | 2000K | | | | | 0:sqlplus.exe(0)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0] | | | |
| 1 | 0 | PX COORDINATOR | | 5 | 2000K | 17 | 63 | 10 | # ## # #### | 1:sqlplus.exe(10)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0] | #### | * (5.6%) | resmgr:cpu quantum(10) |
| 2 | 1 | PX SEND QC (RANDOM) | :TQ10000 | 4 | 2000K | 5 | 61 | 10 | ## # ## ## ## # | 3:P002(5)[544K],P001(4)[487K],P000(1)[535K],P003(0)[434K],sqlplus.exe(0)[0] | # | (5.6%) | ON CPU(7),resmgr:cpu quantum(3) |
| 3 | 2 | VIEW | | 4 | 2000K | 2 | 82 | 69 | #################### | 4:P003(42)[434K],P001(35)[487K],P000(26)[535K],P002(22)[544K],sqlplus.exe(0)[0] | ############ | @@@@@@@@@@@@@@@@@@@ ( 70%) | ON CPU(125) |
| 4 | 3 | PX BLOCK ITERATOR | | 4 | 2000K | | | | | 0:P002(0)[544K],P000(0)[535K],P001(0)[487K],P003(0)[434K],sqlplus.exe(0)[0] | | | |
|* 5 | 4 | TABLE ACCESS FULL| T_1 | 52 | 2000K | 3 | 78 | 29 | ###### ####### # ### | 4:P000(11)[535K],P002(8)[544K],P001(8)[487K],P003(7)[434K],sqlplus.exe(0)[0] | ### | ***** ( 19%) | resmgr:cpu quantum(30),ON CPU(4) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
At runtime the duration of the query now gets reduced significantly and we can see the Parallel Slaves getting used when the VIEW operator gets evaluated. Although the overall CPU time used is similar to the previous example, the duration of the query execution is less since this CPU time is now spent in parallel in the slaves instead in the Query Coordinator.

Summary
By default Oracle performs evaluation at the latest possible point of the execution plan. Sometimes you can improve runtime by actively influencing when the projection will be evaluated by preventing view merging and introducing a VIEW operator that will be used to evaluate the projection clause.

The optimizer so far doesn't seem to incorporate such possibilities in its evaluations of possible plan shapes, so this is something you need to do manually up to and including Oracle 12c (version 12.1.0.2 as of time of writing this).

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 1: Introduction

Sun, 2015-07-26 11:11
12c introduces another interesting new Parallel Execution feature - the parallel evaluation of FILTER subqueries. In pre-12c FILTER subqueries always had to be evaluated in the Query Coordinator. This had several consequences, in particular the data driving the FILTER subquery always had to flow through the Query Coordinator, and hence represented a forced serial execution part of a parallel execution plan. This limitation also meant that depending on the overall plan shape the parallel plan was possibly decomposed into multiple DFO trees, leading to other side effects I've outlined in some of my other publications already.

In 12c now the FILTER subquery can be evaluated in the Parallel Slaves, and the driving data no longer needs to be processed in the Query Coordinator. However, the resulting plan shape can be a little bit confusing. Let's have a look at a simple example:

create table t_1
compress
as
select /*+ use_nl(a b) */
rownum as id
, rpad('x', 100) as filler
from
(select /*+ cardinality(1e5) */ * from dual
connect by
level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel 4;

create index t_1_idx on t_1 (id) invisible;

explain plan for
select /*+
--optimizer_features_enable('11.2.0.4')
*/ count(*) from
t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id);

-- 11.2.0.4 plan shape with index invisible
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 440M (2)| 04:47:04 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
|* 2 | FILTER | | | | | | | | |
| 3 | PX COORDINATOR | | | | | | | | |
| 4 | PX SEND QC (RANDOM)| :TQ20000 | 2000K| 11M| 221 (1)| 00:00:01 | Q2,00 | P->S | QC (RAND) |
| 5 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q2,00 | PCWC | |
| 6 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q2,00 | PCWP | |
| 7 | PX COORDINATOR | | | | | | | | |
| 8 | PX SEND QC (RANDOM)| :TQ10000 | 1 | 6 | 222 (2)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 9 | PX BLOCK ITERATOR | | 1 | 6 | 222 (2)| 00:00:01 | Q1,00 | PCWC | |
|* 10 | TABLE ACCESS FULL| T_1 | 1 | 6 | 222 (2)| 00:00:01 | Q1,00 | PCWP | |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
10 - filter("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index invisible
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 1588M (2)| 17:14:09 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 6 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 6 | | | Q1,00 | PCWP | |
|* 5 | FILTER | | | | | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T_1 | 1 | 6 | 798 (2)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
8 - filter("T_1"."ID"=:B1)

-- 11.2.0.4 plan shape with index visible
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 5973K (1)| 00:03:54 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
|* 2 | FILTER | | | | | | | | |
| 3 | PX COORDINATOR | | | | | | | | |
| 4 | PX SEND QC (RANDOM)| :TQ10000 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 5 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 7 | INDEX RANGE SCAN | T_1_IDX | 1 | 6 | 3 (0)| 00:00:01 | | | |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
7 - access("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index visible
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 5973K (1)| 00:03:54 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 6 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 6 | | | Q1,00 | PCWP | |
|* 5 | FILTER | | | | | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | INDEX RANGE SCAN | T_1_IDX | 1 | 6 | 3 (0)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
8 - access("T_1"."ID"=:B1)

I've included two variations of the setup, one without available index for evaluating the FILTER subquery and one with index.

The pre-12c plan shape without index makes the former limitation particularly obvious: The FILTER operator is above the PX COORDINATOR and marked serial, and the table scan in the FILTER subquery gets parallelized as separate DFO tree (indicated among others by the two PX COORDINATOR operators), which means that each time this separate DFO tree starts, a separate set of Parallel Slave will be allocated/deallocated, adding possibly a lot of overhead to a probably already inefficient execution plan anyway - assuming the FILTER subquery needs to be evaluated many times.

In 12c the FILTER operator is marked parallel and the need for a separate DFO tree is gone. What might be confusing with this plan shape is that the operations of the FILTER subquery are not marked parallel. In my opinion this is misleading and should actually be marked parallel, because at runtime the operations will be performed by the Parallel Slaves, and in case of a Full Table Scan each slave will run the entire full table scan (so no PX ITERATOR for dividing the scan into chunks / granules), which is comparable to what happens when a parallel Nested Loop join runs or the new PQ_REPLICATE feature gets used - and in those cases the operations are marked parallel:

-- 11.2.0.4 / 12.1.0.2 plan shape with index invisible
-- and subquery unnested using NL SEMI join
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 12 | 442M (2)| 04:48:03 | | | |
| 1 | SORT AGGREGATE | | 1 | 12 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 12 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 12 | | | Q1,00 | PCWP | |
| 5 | NESTED LOOPS SEMI | | 2000K| 22M| 442M (2)| 04:48:03 | Q1,00 | PCWP | |
| 6 | PX BLOCK ITERATOR | | | | | | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T_1 | 2000K| 11M| 796 (2)| 00:00:01 | Q1,00 | PCWP | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

8 - filter("T"."ID"="T_1"."ID")

Summary
So the operators of the FILTER subquery can now be run in the slaves, which is the main point of this feature, although being represented in a confusing way in the execution plan. Avoiding the potential decomposition into multiple DFO trees is another possible side effect. Decreased query duration should be possible if the evaluation of the FILTER subquery requires significant time and can now be run in the Parallel Slaves instead of serial execution through the Query Coordinator.

Note that depending on the plan shape and SQL features used, it's still possible that 12c reverts to the old serial FILTER subquery evaluation plan shape, so the new feature doesn't get used always.

There is more to say about this feature. In the next part of this instalment I'll focus on the different distribution methods possible with the new parallel FILTER operator - there is a new PQ_FILTER hint that allows controlling the distribution, but there are also some interesting points to make about how the optimizer seems to make its choice which distribution method to use automatically. In the examples shown here there's no separate distribution for the FILTER, by the way, but this can look differently, as I'll show in the next part.

Video Tutorial: XPLAN_ASH Active Session History - Part 6

Sun, 2015-06-28 15:30
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

12c Parallel Execution New Features: 1 SLAVE distribution

Mon, 2015-06-22 14:39
When certain SQL features get used in pre-12c versions that force non-parallel evaluation, like using ROWNUM or certain Analytic Functions like LAG/LEAD, then - depending on the overall plan shape - Oracle can start to decompose the parallel execution plan into several so called DFO trees (If you want learn more about DFO trees and DFOs I recommend watching my online tutorial on my Youtube channel).

Now having multiple DFO trees in a single parallel execution plan comes with several side effects that are confusing and complicate matters unnecessarily, like each DFO tree allocates its own PX slave set(s), and so each one can potenially end up with a different DOP, which means you can have more than one DOP in a single parallel execution plan.

Depending on the overall plan shape this might also mean that a DFO tree can get started multiple times, and again this means that each time it is started / completed PX slaves need to be allocated and de-allocated, potentially causing a significant overhead coordinating all that activity that is not directly related to the actual execution.

This also means that having multiple DFO trees can lead to a situation where (a lot) more PX slaves are allocated than expected, in case multiple DFO trees are active at the same time - which again means that if you believe you can limit the number of PX slaves allocated by a single parallel execution using Resource Manager directives you might be wrong.

Since all these are undesirable side effects, starting with release 12c Oracle has put effort into new features that minimize the need for such a decomposition into multiple DFO trees. One of these new features is the so called "1 SLAVE" distribution method that can get used if such a non-parallel evaluation is required.

Quite similar to the recently described "PX SELECTOR" operator the "1 SLAVE" distribution uses a single PX slave out of a slave set to execute the non-parallel operations instead of the Query Coordinator. The main difference this makes is that the parallel and serial operations now are still part of the same DFO tree instead of having parts of the execution plan executed by the Query Coordinator and different DFO trees before and after such serial operations.

Let's have a look at a simple example to demonstrate the new feature. I use here three identical tables, just for the sake of being able to differentiate the tables in the plan output - in principle re-using a single table three times would be sufficient.

create table t2
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't2')

create table t4
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't4')

create table t6
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't6')

explain plan for
select /*+ no_merge(x) */
*
from
(
select /*+ parallel(t6 4)
--optimizer_features_enable('11.2.0.4')
*/
*
from
(
select /*+ parallel(t2 4) */
--lag(id) over (order by id) as v1_rn
rownum as v1_rn
, t2.id as v1_id
, t2.filler as v1_filler
from
t2
) v1
, (
select /*+ parallel(t4 2) */
--lag(id) over (order by id) as v2_rn
rownum as v2_rn
, t4.id as v2_id
, t4.filler as v2_filler
from
t4
) v2
, t6
where
v1_id = v2_id
and v1_id = t6.id
) x
where
rownum > 1
;

-- 11.2.0.4 plan shape
----------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | COUNT | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ30002 | Q3,02 | P->S | QC (RAND) |
| 5 | VIEW | | Q3,02 | PCWP | |
|* 6 | HASH JOIN | | Q3,02 | PCWP | |
| 7 | PX RECEIVE | | Q3,02 | PCWP | |
| 8 | PX SEND HASH | :TQ30001 | Q3,01 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | Q3,01 | PCWC | |
| 10 | TABLE ACCESS FULL | T6 | Q3,01 | PCWP | |
| 11 | BUFFER SORT | | Q3,02 | PCWC | |
| 12 | PX RECEIVE | | Q3,02 | PCWP | |
| 13 | PX SEND HASH | :TQ30000 | | S->P | HASH |
|* 14 | HASH JOIN | | | | |
| 15 | VIEW | | | | |
| 16 | COUNT | | | | |
| 17 | PX COORDINATOR | | | | |
| 18 | PX SEND QC (RANDOM)| :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 19 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 20 | TABLE ACCESS FULL| T2 | Q1,00 | PCWP | |
| 21 | VIEW | | | | |
| 22 | COUNT | | | | |
| 23 | PX COORDINATOR | | | | |
| 24 | PX SEND QC (RANDOM)| :TQ20000 | Q2,00 | P->S | QC (RAND) |
| 25 | PX BLOCK ITERATOR | | Q2,00 | PCWC | |
| 26 | TABLE ACCESS FULL| T4 | Q2,00 | PCWP | |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM>1)
6 - access("V1_ID"="T6"."ID")
14 - access("V1_ID"="V2_ID")

-- 12.1.0.2 plan shape
---------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | COUNT | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10004 | Q1,04 | P->S | QC (RAND) |
| 5 | VIEW | | Q1,04 | PCWP | |
|* 6 | HASH JOIN BUFFERED | | Q1,04 | PCWP | |
| 7 | PX RECEIVE | | Q1,04 | PCWP | |
| 8 | PX SEND HASH | :TQ10002 | Q1,02 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | Q1,02 | PCWC | |
| 10 | TABLE ACCESS FULL | T6 | Q1,02 | PCWP | |
| 11 | PX RECEIVE | | Q1,04 | PCWP | |
| 12 | PX SEND HASH | :TQ10003 | Q1,03 | S->P | HASH |
|* 13 | HASH JOIN BUFFERED | | Q1,03 | SCWC | |
| 14 | VIEW | | Q1,03 | SCWC | |
| 15 | COUNT | | Q1,03 | SCWP | |
| 16 | PX RECEIVE | | Q1,03 | SCWP | |
| 17 | PX SEND 1 SLAVE | :TQ10000 | Q1,00 | P->S | 1 SLAVE |
| 18 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 19 | TABLE ACCESS FULL| T2 | Q1,00 | PCWP | |
| 20 | VIEW | | Q1,03 | SCWC | |
| 21 | COUNT | | Q1,03 | SCWP | |
| 22 | PX RECEIVE | | Q1,03 | SCWP | |
| 23 | PX SEND 1 SLAVE | :TQ10001 | Q1,01 | P->S | 1 SLAVE |
| 24 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
| 25 | TABLE ACCESS FULL| T4 | Q1,01 | PCWP | |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM>1)
6 - access("V1_ID"="T6"."ID")
13 - access("V1_ID"="V2_ID")
Let's start with the 11.2.0.4 plan shape: We can see from multiple occurrences of the PX COORDINATOR operator, the TQ column and the corresponding TQ entries in the NAME column that this parallel execution plan consists of three DFO trees. When analyzing the actual run time activity it would also become obvious that the different DFO trees actually run at a different DOP and use different PX slave sets. In this particular case here it would even become obvious that two DFO trees are active at the same time - so it's pretty much demonstrating all the undesirable side effects of having multiple DFO trees that I mentioned above.

The HASH JOIN between V1 and V2 runs serially, whereas the HASH JOIN between T6 and the join result of V1 and V2 runs parallel. Since we have a Serial->Parallel distribution between these two HASH JOINs, an additional BUFFER SORT operation gets added - as outlined in the PX SELECTOR note.

If we now look at the 12.1.0.2 plan shape we notice that the execution plan consists of a single DFO tree, and hence all the side effects I've outlined are gone. The serial evaluation is done by a single PX slave of one set, again indicated by the SCWC/SCWP decorator, similar to the PX SELECTOR operator.

However, the plan shape also demonstrates one possible disadvantage of the new feature: Since now everything is part of a single DFO tree we end up with more re-distributions per DFO tree, and hence might have an increased demand for the dreaded BUFFERing to prevent multiple re-distributions being active at the same time - here this can be seen in operation ID 13 - the HASH JOIN executed serially by one PX slave is actually turned into a HASH JOIN BUFFERED (so the worst case for a (not really) "parallel" hash join - executed only by a single PX slave and needs to buffer the right hand row source in addition), and also the second HASH JOIN is turned into its BUFFERED variant. This means compared to the 11.2.0.4 plan shape that contains no HASH JOIN BUFFERED but an additional BUFFER SORT operation, the 12.1.0.2 plan shape with this data pattern actually requires more PGA / TEMP space than the 11.2.0.4 plan shape due to the double buffering now necessary.

Footnote
The new 1 SLAVE distribution doesn't get used always in 12c. If you run just the join between V1 and V2 for example, then the old plan shape will be used, and there are again multiple DFO trees. Furthermore, in this particular case, when you start changing the DOP used in the PARALLEL hints you also might end up with a plan shape where one view uses the 1 SLAVE distribution whereas the other one uses the old plan shape with Query Coordinator activity - I haven't investigated further why this happens.

If Analytic Functions get used, you might also see a "1 SLAVE (ORDER)" variation of the distribution that enforces a certain order when re-distributing the data, similar to the "PX SEND QC (ORDER)" operator.

12c Parallel Execution New Features: PX SELECTOR

Sun, 2015-06-14 13:49
Continuing my series on new 12c Parallel Execution features: I've already mentioned the new PX SELECTOR operator as part of the new Concurrent UNION ALL feature where it plays a key role. However, in general starting from 12c this new operator usually will get used when it comes to executing a serial part of the execution plan, like a full scan of an object not marked parallel, or an index based operation that can't be parallelized.

In pre-12c such serial parts get executed by the Query Coordinator itself, and the new PX SELECTOR changes that so that one of the PX slaves of a PX slave set is selected to execute that serial part.

There is not much left to say about that functionality, except that it doesn't get used always - there are still plan shapes possible in 12c, depending on the SQL constructs used and combined, that show the pre-12c plan shape where the Query Coordinator executes the serial part.

Let's have a look at a simple example to see in more detail what difference the new operator makes to the overall plan shape and runtime behaviour:

create table t1 as select * from dba_objects;

exec dbms_stats.gather_table_stats(null, 't1')

alter table t1 parallel;

create table t2 as select * from dba_objects;

exec dbms_stats.gather_table_stats(null, 't2')

create index t2_idx on t2 (object_name);

select /*+ optimizer_features_enable('11.2.0.4') */
*
from
t1
, t2
where
t1.object_id = t2.object_id
and t2.object_name like 'BLUB%'
;

-- 11.2.0.4 plan shape
-----------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | PX COORDINATOR | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | Q1,01 | P->S | QC (RAND) |
|* 3 | HASH JOIN | | Q1,01 | PCWP | |
| 4 | BUFFER SORT | | Q1,01 | PCWC | |
| 5 | PX RECEIVE | | Q1,01 | PCWP | |
| 6 | PX SEND BROADCAST | :TQ10000 | | S->P | BROADCAST |
| 7 | TABLE ACCESS BY INDEX ROWID| T2 | | | |
|* 8 | INDEX RANGE SCAN | T2_IDX | | | |
| 9 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
|* 10 | TABLE ACCESS FULL | T1 | Q1,01 | PCWP | |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")
8 - access("T2"."OBJECT_NAME" LIKE 'BLUB%')
filter("T2"."OBJECT_NAME" LIKE 'BLUB%')
10 - filter(SYS_OP_BLOOM_FILTER(:BF0000,"T1"."OBJECT_ID"))

-- 12.1.0.2 plan shape
--------------------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | PX COORDINATOR | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | Q1,01 | P->S | QC (RAND) |
|* 3 | HASH JOIN | | Q1,01 | PCWP | |
| 4 | JOIN FILTER CREATE | :BF0000 | Q1,01 | PCWP | |
| 5 | PX RECEIVE | | Q1,01 | PCWP | |
| 6 | PX SEND BROADCAST | :TQ10000 | Q1,00 | S->P | BROADCAST |
| 7 | PX SELECTOR | | Q1,00 | SCWC | |
| 8 | TABLE ACCESS BY INDEX ROWID BATCHED| T2 | Q1,00 | SCWC | |
|* 9 | INDEX RANGE SCAN | T2_IDX | Q1,00 | SCWP | |
| 10 | JOIN FILTER USE | :BF0000 | Q1,01 | PCWP | |
| 11 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
|* 12 | TABLE ACCESS FULL | T1 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")
9 - access("T2"."OBJECT_NAME" LIKE 'BLUB%')
filter("T2"."OBJECT_NAME" LIKE 'BLUB%')
12 - filter(SYS_OP_BLOOM_FILTER(:BF0000,"T1"."OBJECT_ID"))
The pre-12c plan shape here shows two significant things that I want to emphasize:

First this plan shape only requires a single PX slave set since the Query Coordinator takes over the part that needs to be re-distributed, so although we have a plan shape that requires re-distribution there's only a single PX slave set involved. In case there is at least one operation that gets executed in parallel and requires re-distribution there always will be two PX slave sets.

Second the plan shape demonstrates that parts of a Parallel Execution plan that get executed serially by the Query Coordinator require an additional BUFFER SORT operation. The HASH JOIN operation itself is blocking while it is consuming the left row source for building the hash table, so there is no true requirement to add another BUFFER SORT after the PX RECEIVE operation, but it looks like a pretty strict rule that any serial activity that involves the Query Coordinator adda a BUFFER SORT operation after re-distribution - I assume the reasoning for this is that the Query Coordinator isn't available for "coordinating" the PX slaves as along as it is actively involved in executing serial operations, hence the need to block any other parallel activity.

This normally shouldn't be too relevant to performance since you should only execute operations serially that are tiny and not worth to run parallel, so buffering them shouldn't add much overhead, but it's just another reason why you see additional BUFFER SORT operations in parallel plans that are not there in serial-only plans.

The 12c plan shape shows the new PX SELECTOR operator that executes now the serial part of the execution plan instead of the Query Coordinator. This also adds new decorators in the IN-OUT column called "SCWC" and "SCWP" respectivley, which you won't find in pre-12c plans - they are probably meant to read "Serial Combined With Child/Parent", similar to "PCWC/PCWP".

The good thing about the new PX SELECTOR is that the need for an additional BUFFER SORT operator is now gone.

However, one side-effect of the new operator for this particular plan shape here is that now a second PX slave set is allocated, although only one PX slave actually will get used at runtime. Note that for other plan shapes that need two PX slave sets anyway this doesn't matter.

Another good thing about the new PX SELECTOR operator is that it avoids an odd bug that sometimes happens with Serial->Parallel redistributions when the Query Coordinator is involved. This bug causes some delay to the overall execution that usually isn't too relevant since it only adds approx 1-2 seconds delay (but it can occur several times per execution so these seconds can add up) and therefore is rarely noticed when a Parallel Execution might take several seconds / minutes typically. I might cover this bug in a separate blog post.

Unrelated to the PX SELECTOR operator, the 12c plan shape also demonstrates that in 12c the way Bloom filters are shown in the plan has been improved. The 11.2.0.4 version includes the same Bloom filter as you can see from the "Predicate Information" section of the plan but doesn't make it that obvious from the plan shape that it is there (and sometimes in pre-12c it even doesn't show up in the "Predicate Information" section but is still used)

Temp Table Transformation Cardinality Estimates - 2

Fri, 2015-05-29 02:00
Continuing from the previous part - which was about the Temp Table Transformation and join cardinality estimates - using the same simple table setup here is a slight variation of the previously used query to demonstrate the potential impact on single table cardinality estimates:

explain plan for
with
cte as (
select /* inline */ id from t1 t
where 1 = 1
)
select /*+
no_merge(a) no_merge(b)
*/ * from cte a, cte b
where a.id = b.id
and a.id > 990 and b.id > 990
;

-- 11.2.0.x Plan with TEMP transformation
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1000 | 26000 |
| 1 | TEMP TABLE TRANSFORMATION | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9D661C_275FD9 | | |
| 3 | TABLE ACCESS FULL | T1 | 1000 | 4000 |
|* 4 | HASH JOIN | | 1000 | 26000 |
|* 5 | VIEW | | 1000 | 13000 |
| 6 | TABLE ACCESS FULL | SYS_TEMP_0FD9D661C_275FD9 | 1000 | 4000 |
|* 7 | VIEW | | 1000 | 13000 |
| 8 | TABLE ACCESS FULL | SYS_TEMP_0FD9D661C_275FD9 | 1000 | 4000 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."ID"="B"."ID")
5 - filter("A"."ID">990)
7 - filter("B"."ID">990)

-- 11.2.0.x Plan with INLINE hint
----------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
----------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 260 |
|* 1 | HASH JOIN | | 10 | 260 |
| 2 | VIEW | | 10 | 130 |
|* 3 | TABLE ACCESS FULL| T1 | 10 | 40 |
| 4 | VIEW | | 10 | 130 |
|* 5 | TABLE ACCESS FULL| T1 | 10 | 40 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("A"."ID"="B"."ID")
3 - filter("ID">990)
5 - filter("ID">990)
Again it's obvious that the Temp Table Transformation can have significant impact on the single table cardinality estimates.

In particular:

- Although the same filter is applied in both cases to the rowsources A and B, in case of the Temp Table Transformation it doesn't reduce the cardinality. So it's not uncommon to end up with significant cardinality overestimates in case the transformation gets used

- For Exadata environments particularly bad is that the filter isn't pushed into the TABLE ACCESS FULL operator, but only applied in the VIEW operator above, which means that it can't be offloaded - all the data needs to be sent from the Storage Cells to the Compute Nodes and filtered there. Not a very efficient way to operate on Exadata

The behaviour is still the same in 12c.

Temp Table Transformation Cardinality Estimates - 1

Mon, 2015-05-25 13:26
Having published recently two notes about the Temp Table Transformation highlighting the heuristics based decision and other weaknesses, for example regarding the projection of columns, it's time to publish some more notes about it.

The transformation can also have significant impact on cardinality estimates, both join and single table cardinality.

Looking at the difference in the join cardinality estimates of following simple example:

create table t1
as
select
rownum as id
, mod(rownum, 10) + 1 as id2
, rpad('x', 100) as filler
from
dual
connect by
level <= 1000
;

exec dbms_stats.gather_table_stats(null, 't1')

alter session set tracefile_identifier = 'temp_trans_join_card';

alter session set events '10053 trace name context forever, level 1';

explain plan for
with
cte as (
select /* inline */ id + 1 as id from t1 t
where 1 = 1
)
select /*+
--opt_estimate(@"SEL$2" join("A"@"SEL$2" "B"@"SEL$2") rows=1000)
no_merge(a) no_merge(b)
*/ * from cte a, cte b
where a.id = b.id
;

alter session set events '10053 trace name context off';

-- 11.2.0.x Plan with TEMP transformation
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 |
| 1 | TEMP TABLE TRANSFORMATION | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9D660C_27269C | | |
| 3 | TABLE ACCESS FULL | T1 | 1000 | 4000 |
|* 4 | HASH JOIN | | 1 | 26 |
| 5 | VIEW | | 1000 | 13000 |
| 6 | TABLE ACCESS FULL | SYS_TEMP_0FD9D660C_27269C | 1000 | 4000 |
| 7 | VIEW | | 1000 | 13000 |
| 8 | TABLE ACCESS FULL | SYS_TEMP_0FD9D660C_27269C | 1000 | 4000 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."ID"="B"."ID")

-- 11.2.0.x Plan with INLINE hint
----------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
----------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 | 253K|
|* 1 | HASH JOIN | | 10000 | 253K|
| 2 | VIEW | | 1000 | 13000 |
| 3 | TABLE ACCESS FULL| T1 | 1000 | 4000 |
| 4 | VIEW | | 1000 | 13000 |
| 5 | TABLE ACCESS FULL| T1 | 1000 | 4000 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("A"."ID"="B"."ID")
the following becomes obvious:

- There are vastly different cardinality estimates possible depending on whether the transformation gets used or not. In principle due to the NO_MERGE hints used, the transformation shouldn't have any impact on the estimates, but it does

- Looking at the optimizer trace file some information seems to get lost when the transformation gets used, in particular column related statistics

- This information loss, like in the example here, can lead to join cardinality estimates of 0 (rounded to 1 in the plan output)

- And even worse, at least in pre-12c versions, such a 0 cardinality estimate can't be corrected via OPT_ESTIMATE hints, since somehow the correction gets ignored/lost although being mentioned in the optimizer trace:

11.2.0.1:
Column (#1): ID(
AvgLen: 22 NDV: 1 Nulls: 0 Density: 0.000000
Column (#1): ID(
AvgLen: 22 NDV: 1 Nulls: 0 Density: 0.000000
Join Card: 0.000000 = = outer (1000.000000) * inner (1000.000000) * sel (0.000000)
>> Join Card adjusted from 0.000000 to: 1000.000000, prelen=2
Adjusted Join Cards: adjRatio=1.00 cardHjSmj=0.000000 cardHjSmjNPF=0.000000 cardNlj=0.000000 cardNSQ=1000.000000 cardNSQ_na=0.000000
Join Card - Rounded: 1 Computed: 0.00
The behaviour regarding the OPT_ESTIMATE hint changes in 12c, but then there are other oddities introduced in 12c that are not there in pre-12c - have a look at the "Query Block" section when using the INLINE variant of the query - there are two identical fully qualified object names, clearly a bug, making hinting using global hint syntax impossible for that query block.

Although my simple example here can be corrected via extended statistics on the join column expression used in the CTE query my point here is that depending on whether the transformation gets used or not vastly different and extreme cardinality estimates are possible - and those extreme cases even can't be corrected in pre-12c versions.

For example I recently had a real life case where two columns were joined that had a significant number of NULL values, one coming from a temp table transformation row source. Without the transformation the join cardinality estimates were reasonable, but the transformation again lead to such a 0 cardinality estimate (that couldn't be corrected via a (correctly specified) OPT_ESTIMATE hint), ruining the whole plan.