Randolf Geist

Subscribe to Randolf Geist feed
Updated: 6 hours 29 min ago

Video Tutorial: XPLAN_ASH Active Session History - Part 9

Sun, 2016-07-24 09:21
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

Video Tutorial: XPLAN_ASH Active Session History - Part 8

Sun, 2016-07-17 09:54
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.


Comparing Columns Containing NULL Values

Sun, 2016-07-10 06:32
Prompted by a (not really that) recent discussion on the OTN forum I've decided to publish this note.

Sometimes you have the task of comparing column values and handling the NULL value cases correctly makes this rather cumbersome for columns that are allowed to be NULL.

The "official" SQL way of comparing two column values and to find out whether they are equal or not - under the assumption that having NULL in both columns should be treated as equal (a point that can be argued) would read in SQL something like the following for the "unequal" case:

column1 != column2 or (column1 is null and column2 is not null) or (column1 is not null and column2 is null)
and

column1 = column2 or (column1 is null and column2 is null)
for the "equal" case.

Imagine this to be done for a lot of columns and it becomes obvious that people might be looking for more concise alternatives how to express this.

Sometimes you might have come across code that uses NVL or a similar function with some "special" value that is not supposed to be part of the column data to be compared - like -1 for (positive) integer values for example, but this has the inherent risk of producing wrong results if the "special" value is (or might become in future) not that special after all.

The not really documented SYS_OP_MAP_NONNULL function could come handy if it was official, because it covers the NULL cases automatically.

Note: I still would call it undocumented even if it shows up in some "Materialized View" related part of the official 12c documentation, because it is not part of the official SQL language manual

SYS_OP_MAP_NONNULL(column1) != SYS_OP_MAP_NONNULL(column2)
is all that is needed. More concise as above, but undocumented.

And: There is one obvious drawback to SYS_OP_MAP_NONNULL, since it adds one byte to the input value to ensure the mapped NULL value differs from any other value: It doesn't work properly with maxed out columns, for example this:

SQL> create table t
2 as
3 select rpad('x', 4000) as col1, rpad('x', 4000) as col2 from dual
4 union all
5 select rpad('x', 4000) as col1, rpad('y', 4000) as col2 from dual
6 union all
7 select null as col1, rpad('y', 4000) as col2 from dual
8 union all
9 select rpad('x', 4000) as col1, null from dual
10 union all
11 select null as col1, null as col2 from dual
12 ;

Table created.

SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't')

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from t where sys_op_map_nonnull(col1) = sys_op_map_nonnull(col2);
select count(*) from t where sys_op_map_nonnull(col1) = sys_op_map_nonnull(col2)
*
ERROR at line 1:
ORA-01706: user function result value was too large
errors out with "ORA-01706: user function result value was too large".

DECODE
Decode is an exception to the rule how Oracle treats NULLs in SQL - it returns the "equal" value if both operands to compare are NULL, so it allows the following construct to achieve above comparison (see above OTN forum discussion for credits where this expression appeared first, more recently here on Stew Ashton's blog):

For checking for inequality:

decode(column1, column2, 0, 1) = 1
For equality:

decode(column1, column2, 0, 1) = 0
Very concise, and officially documented.

Interestingly, starting with 11.2.0.2 there is an implicit re-write by default enabled that rewrites above expression (and only this variation, not the one testing for inequality):

decode(column1, column2, 0, 1) = 0
into

sys_op_map_nonnull(column1) = sys_op_map_nonnull(column2)
This is controlled via fix control 8551880 ("Rewrite decode predicate to join").

Guess what happens when running the following query in 11.2.0.2 or later against above data set (the fix control is enabled by default in these versions):

SQL> select /*+ opt_param('_fix_control', '8551880:0') */ count(*) from t where decode(col1, col2, 0, 1) = 0;

COUNT(*)
----------
2

SQL>
SQL> select /*+ opt_param('_fix_control', '8551880:1') */ count(*) from t where decode(col1, col2, 0, 1) = 0;
select /*+ opt_param('_fix_control', '8551880:1') */ count(*) from t where decode(col1, col2, 0, 1) = 0
*
ERROR at line 1:
ORA-01706: user function result value was too large
Finally, what about performance? For that purpose I created the following compressed 50M rows table, covering different cases:

create table t1 compress
as
select case mod(rownum, 5) + 1
when 1 then cast('C1' as varchar2(20))
when 2 then cast(null as varchar2(20))
when 3 then cast('C3' as varchar2(20))
when 4 then cast(null as varchar2(20))
when 5 then cast('C5' as varchar2(20))
end as col1,
case mod(rownum, 5) + 1
when 1 then cast('C2' as varchar2(20))
when 2 then cast(null as varchar2(20))
when 3 then cast('C3' as varchar2(20))
when 4 then cast('C4' as varchar2(20))
when 5 then cast(null as varchar2(20))
end as col2
from
(select /*+ cardinality(1e4) */ null from dual connect by level <= 1e4)
, (select /*+ cardinality(5e3) */ null from dual connect by level <= 5e3)
;
and got the following results from 11.2.0.4 (ignoring the different cardinality estimates for the moment, since I'm only interested in the elapsed time here):

SQL> select count(*) from t1;

Elapsed: 00:00:00.94

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18743 (2)| 00:03:45 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T1 | 50M| 18743 (2)| 00:03:45 |
-------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where decode(col1, col2, 0, 1) = 1;

Elapsed: 00:00:02.55

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19176 (4)| 00:03:51 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 500K| 1953K| 19176 (4)| 00:03:51 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(DECODE("COL1","COL2",0,1)=1)


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where SYS_OP_MAP_NONNULL("COL1")!=SYS_OP_MAP_NONNULL("COL2");

Elapsed: 00:00:03.04

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19515 (6)| 00:03:55 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 15M| 60M| 19515 (6)| 00:03:55 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(SYS_OP_MAP_NONNULL("COL1")<>SYS_OP_MAP_NONNULL("COL2"))


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where (col1 != col2 or (col1 is null and col2 is not null) or (col1 is not null and col2 is null));

Elapsed: 00:00:03.51

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 18876 (2)| 00:03:47 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 30M| 115M| 18876 (2)| 00:03:47 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COL1" IS NULL AND "COL2" IS NOT NULL OR "COL2" IS NULL
AND "COL1" IS NOT NULL OR "COL1"<>"COL2")



Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select /*+ opt_param('_fix_control', '8551880:0') */ count(*) from t1 where decode(col1, col2, 0, 1) = 0;

Elapsed: 00:00:02.49

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19176 (4)| 00:03:51 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 500K| 1953K| 19176 (4)| 00:03:51 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(DECODE("COL1","COL2",0,1)=0)


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select /*+ opt_param('_fix_control', '8551880:1') */ count(*) from t1 where decode(col1, col2, 0, 1) = 0;

Elapsed: 00:00:03.04

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19515 (6)| 00:03:55 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 6000K| 22M| 19515 (6)| 00:03:55 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(SYS_OP_MAP_NONNULL("COL1")=SYS_OP_MAP_NONNULL("COL2"))


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where col1 = col2 or (col1 is null and col2 is null);

Elapsed: 00:00:02.30

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 18865 (2)| 00:03:47 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 13M| 49M| 18865 (2)| 00:03:47 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COL1" IS NULL AND "COL2" IS NULL OR "COL1"="COL2")


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Surprisingly, in my tests SYS_OP_MAP_NONNULL always performed worst, and the DECODE expression pretty close to the more verbose SQL expression - depending on the data set sometimes even faster.

So using SYS_OP_MAP_NONNULL, and in particular the implicit rewrite might not be such a good idea after all. And now you know how the automatic rewrite could be disabled if desired.

New Version Of XPLAN_ASH Utility

Tue, 2016-06-28 18:23
A new version 4.23 of the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

This version comes only with minor changes, see the change log below.

Here are the notes from the change log:

- Finally corrected the very old and wrong description of "wait times" in the script comments, where it was talking about "in-flight" wait events but that is not correct. ASH performs a "fix-up" of the last 255 samples or so and updates them with the time waited, so these wait events are not "in-flight"

- Removed some of the clean up code added in 4.22 to the beginning of the script, because it doesn't really help much but spooled script output always contained these error messages about non-existent column definitions being cleared

- The "Concurrent I/O" sections will now also be skipped in LIMITED_ASH mode

- Some more fixes to the I/O figures in the "Activity Timeline based on ASH" - the spreading introduced in 4.22 needed some further refinement (see 4.22 change log for more details)

Combining Features - Wrong Results With Scalar Subquery Caching

Wed, 2016-06-08 15:39
Quite often you can get into trouble with Oracle when you start combining different features.

In this case of one my clients it is the combination of user-defined PL/SQL functions that can raise exceptions (think of currency conversion and a non-existent currency code gets passed into the function), DML error logging and attempting to improve performance by wrapping the PL/SQL function call into a scalar subquery to benefit from the built-in scalar subquery caching feature of the SQL runtime engine.

As long as the scalar subquery didn't get used everything worked as expected, but after adding the scalar subquery after some while it became obvious that wrong results occurred - in that particular case here it meant rows that should have been rejected and written to the error logging table due to the exception raised in the user-defined PL/SQL function suddenly showed up in the target table, and what was even more worrying - they included a converted amount, obviously using some arbitrary conversion rate, although the PL/SQL function actually didn't return any value but raised an exception for unknown currency codes.

The interesting point here is that everything works as expected if the error logging feature doesn't get used, or if the scalar subquery doesn't get used.

In case of no error logging the whole statement will roll back if an exception occurs, and that holds true and works even with scalar subquery caching in place.

Without scalar subquery caching the correct result gets produced, all rows that are supposed to be rejected are written to the error logging table and don't show up in the target table.

And if no exception gets raised (only valid and existing currency codes get used), again the results are correct in the target table even when using scalar subquery caching - so the caching of the values in principle works correctly.

Only when combining those features and dealing with invalid data the problem shows up, and as it can be demonstrated from a simple test case, even then it only occurs under certain circumstances.

Here is a simple test case for reproducing the issue:

set echo on timing on time on

drop table t_source purge;

drop table t_dest purge;

drop table ERR$_T_DEST purge;

drop package test_plsql_caching_with_except;

create table t_source
as
select
rownum as id
, cast('C' || to_char(round(trunc(rownum / 100000)) + 1, 'FM00') as varchar2(3)) as currency_code_clust
, cast('C' || to_char(mod(rownum, 10) + 1, 'FM00') as varchar2(3)) as currency_code_scat
, 100 as amount
from
(select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3) gen1
, (select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3) gen2
;

exec dbms_stats.gather_table_stats(null, 't_source')

create or replace package test_plsql_caching_with_except
as
function test_plsql_caching_with_except(s_currency_code in varchar2, s_do_excpt_on in varchar2 default 'C10') return number parallel_enable;
end;
/

create or replace package body test_plsql_caching_with_except
as
function test_plsql_caching_with_except(s_currency_code in varchar2, s_do_excpt_on in varchar2 default 'C10') return number parallel_enable
is
begin
if s_currency_code = s_do_excpt_on then
raise_application_error(-20001, 'exception encountered');
end if;
--
return to_number(substr(s_currency_code, 2));
end;
end;
/

create table t_dest (id number(*, 0), currency_code varchar2(3), org_amount number, conv_amount number);

exec DBMS_ERRLOG.create_error_log (dml_table_name => 't_dest')

truncate table t_dest;

truncate table ERR$_T_DEST;

alter session enable parallel dml;

-- Using "currency_code_scat" as function parameter gives correct result in T_DEST
-- But column CONV_AMOUNT in ERR$_T_DEST shows some function result that shouldn't be there since the function raised an exception (reproduces in 11.2.0.4 but not 12.1.0.2)
--
-- define ccy_code = currency_code_scat

-- Using "currency_code_clust" as function parameter gives wrong result in T_DEST - rows that should go to ERR$_T_DEST
-- due to exception being raised in PL/SQL function end up in T_DEST (with some "cached" function result applied)
-- There shouldn't be any rows in T_DEST with "currency_code = 'C10'"
-- Applies to both serial and Parallel Execution, conventional and direct path insert
define ccy_code = currency_code_clust

-- Use commented, non-cached function call to see correct results
insert /*+
append
no_parallel
-- parallel(n)
*/
into t_dest (id, currency_code, org_amount, conv_amount)
select id, &ccy_code, amount,
--test_plsql_caching_with_except.test_plsql_caching_with_except(&ccy_code) * amount
(select test_plsql_caching_with_except.test_plsql_caching_with_except(&ccy_code) * amount from dual)

from t_source
log errors
reject limit unlimited;

commit;

-- This is supposed to 900K
select count(*) from t_dest;

-- This is supposed to be 0
select count(*) from t_dest where currency_code = 'C10';

-- This is supposed to 100K
select count(*) from ERR$_T_DEST;

-- This is supposed to 0
select count(*) from ERR$_T_DEST where conv_amount is not null;
The second parameter to the PL/SQL function determines for which code an exception should be raised (or if at all), and hence allows reproducing different scenarios. By default it will raise an exception for code "C10" which represents 10% (100K rows) of the data in T_SOURCE. So 900K rows should end up in the destination table and those 100K rows having code "C10" should go into the error logging table.

Running the test case without the scalar subquery gets the correct result:

SQL> -- This is supposed to 900K
SQL> select count(*) from t_dest;

COUNT(*)
----------
900000

Elapsed: 00:00:00.70
SQL>
SQL> -- This is supposed to be 0
SQL> select count(*) from t_dest where currency_code = 'C10';

COUNT(*)
----------
0

Elapsed: 00:00:00.21
SQL>
SQL> -- This is supposed to 100K
SQL> select count(*) from ERR$_T_DEST;

COUNT(*)
----------
100000

Elapsed: 00:00:00.70
SQL>
SQL> -- This is supposed to 0
SQL> select count(*) from ERR$_T_DEST where conv_amount is not null;

COUNT(*)
----------
0

Elapsed: 00:00:00.04
SQL>
Use the scalar subquery, and suddendly instead of 100K rows that should be rejected only a single one gets rejected and 99.999 make it into the target table with some converted amount:

SQL> -- This is supposed to 900K
SQL> select count(*) from t_dest;

COUNT(*)
----------
999999

Elapsed: 00:00:00.23
SQL>
SQL> -- This is supposed to be 0
SQL> select count(*) from t_dest where currency_code = 'C10';

COUNT(*)
----------
99999

Elapsed: 00:00:00.20
SQL>
SQL> -- This is supposed to 100K
SQL> select count(*) from ERR$_T_DEST;

COUNT(*)
----------
1

Elapsed: 00:00:00.00
SQL>
SQL> -- This is supposed to 0
SQL> select count(*) from ERR$_T_DEST where conv_amount is not null;

COUNT(*)
----------
1

Elapsed: 00:00:00.01
SQL>
Now the critical point is the order of the column values in the column passed into the PL/SQL function. The problem can only be reproduced when using the "clustered" variant of the currency codes (column CURRENCY_CODE_CLUST in T_SOURCE). Switch to the "scattered" variant (CURRENCY_CODE_SCAT), and the (almost) correct result gets produced, at least in the target table. I say "almost" because looking closer at the rows written to the error logging table it becomes obvious that they contain a converted amount that definitely shouldn't be there - so the wrong result error already shows up here, since the row was written to the error logging table due to the PL/SQL function raising an exception, hence the converted amount should be NULL. This "wrong converted amount in the error logging table" no longer occurs in 12c (12.1.0.2), so something has changed but the problem of wrong results (rows with errors / exceptions go into the target table) is still reproducible in 12c.

The problem seems to be related to the way how the scalar subquery caching works when the same value comes again - Jonathan Lewis described this feature in "Cost Based Oracle" a long time ago. There seems to be special optimisation that avoids the cache lookup in case the same value as before is used again, like a "single value cache" on top of the actual cache. And it looks like it is this special treatment that doesn't handle correctly the case of the exception being raised in the PL/SQL function.

The test case also allows checking if other caching mechanisms like declaring the function "deterministic" (although in the real life case it isn't really hence shouldn't be used since it opens up other possibilities like building a Materialized View or a function-based index based on the PL/SQL function that potentially would lead to wrong results again then) or making use of the PL/SQL function result cache suffer from the same problem - and they don't. Both produce the correct results under all scenarios tested.

I have a more complex variant of the test case that also allows spotting more clearly the critical difference between the "scalar subquery / deterministic caching" that obviously happens in the SQL engine and the PL/SQL function result cache that is implemented in the PL/SQL engine by counting the number of times the function code gets actually executed and how many times the PL/SQL code block got executed according to the Library Cache information.

In this case here where the overhead of the actual function call is very small but it is called very frequently the PL/SQL function result cache adds much more overhead than the SQL engine caching, because the context switch to PL/SQL needs to be done for every function invocation, only to be then resolved in the PL/SQL function result cache instead of actually executing the function code.

Whereas the "scalar subquery / deterministic" cache avoids the context switch to PL/SQL and uses values from a cache maintained in the SQL engine.

So the lesser the overhead of the actual PL/SQL function code and the more frequent the function gets called, the more overhead the usage of the PL/SQL function result cache will produce in comparison to the SQL engine based caching mechanisms.

Of course the PL/SQL function result cache comes with other features that are not there with the others - like cross session / statement invocation caching that might make a big difference if a single function call causes a lot of overhead.

Oracle Database Cloud (DBaaS) Performance Consistency - Part 1

Sun, 2016-06-05 15:58
As Oracle ACE Director I got an extended trial license for Oracle's Cloud offerings, in particular the "Database as a Service" offering. As part of the (ongoing) evaluation I try to get an idea how consistent the performance of such an service is, which might be one of the concerns one might have when considering cloud offerings in general.

For my tests I've set up a 11.2.0.4 single instance database using "4 OCPUs" (Oracle CPUs) which ends up as an Oracle Linux 6 system showing 8 CPUs *and* 8 cores of type "Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz".

As edition for the database I've chosen the "Extreme Performance" Enterprise Edition which also shows up at the version banner (note the difference to regular database installations, not sure this might break some applications that don't know this banner):

BANNER
-------------------------------------------------------------------------
Oracle Database 11g EE Extreme Perf Release 11.2.0.4.0 - 64bit Production

I personally find the names of the different available editions not really a good choice - as far as I understand the different editions, although being called "Standard Edition", "Enterprise Edition", "Enterprise Edition - High Performance" and "Enterprise Edition - Extreme Performance" only differentiate in the number of available and usable options, like the Diagnostics and Tuning Pack license or the In-Memory Column Store (12c), but not in the basic performance profile, like CPU speed or I/O performance (although Oracle also has an Exadata Service Cloud offering for higher demands in terms of performance). Note that Oracle also offers a "High Memory" configuration that doubles the available memory for the same number of available CPU cores.
For evaluating the performance consistency I plan to run different flavors of tests on this test environment - the first and most simple one is a straight CPU burning PL/SQL loop spawned as many times as CPUs available (so eight times for this test system here). I deliberately disabled the database Resource Manager for this test.
For comparison I ran exactly the same test on a physical test system of my own to see how the two systems compare in terms of performance consistency. The physical test system is older and slower and only has four cores available, so the test script was only spawned four times in this case.
The loop executed n times concurrently was simply this, and this was kept running for several days to get also an idea if there are noticeable differences on different days of the week:

set echo on

define thread_id = &1;

declare
  n number;
begin
  loop
    n := 0;
    for i in 1..1000000000 loop
      n := n + 1;
    end loop;
    insert into timings(testtype, thread_id, ts) values ('PLSQL', &thread_id, systimestamp);
    commit;
  end loop;
end;
/

These were the results for the DBaaS service:
And these the results for the physical host:

The graph is supposed to show how many of the runs deviated how much from the overall median runtime, so the Y axis represents the percentage, and the X axis represents the deviation from the median runtime, for example 0.5 on the Y axis means 0.5 percent deviation from the median value
Looking at the results the following becomes obvious:
- The physical host provides a slightly more consistent performance (as expected), which means that it has 73% of the runs with 0.5 percent or less deviation from the median runtime, whereas the DBaaS service had "only" 64% - and it should be noted that I tried on that physical system to minimize any "noise" generated by other processes on the system. I didn't attempt to do the same for the DBaaS service, although in principle I could have tried because you get root access for this kind of service and can do whatever you want. For my physical host though in this case here it is interesting to see that there is a much more significant number of runs that deviate more than one percent from the median value, which is not the case for the DBaaS service. This also means if I change above analysis to the number of runs with 1 percent or less deviation from the median runtime, the DBaaS service (85%) actually is more consistent than my physical host (still 73%).
- The DBaaS service has some more and some more extreme outliers, but not really significant
- The overall performance consistency of the DBaaS service is pretty close to the physical host
The same graph on a per day basis (DBaaS service):
and physical host:

- The DBaaS service after a couple of days showed a very stable consistency pattern, only during the first four days the pattern was different:
First days:
Remaining days:
- For this test run no significant differences between different days could be spotted, except for the noticeable difference between the first few days and the remaining days
One interesting point is that in the DBaaS service the different threads showed consistent, but different runtimes, something that couldn't be seen for the physical host, where all threads showed similar runtimes. Whether this might have to do with some NUMA configuration or similar I can't tell, but it is at obvious for this test run - the Y axis represents the median duration per run per day:
DBaas service:
 Physical host:
The next test will be a CPU bound SQL operation that performs logical I/Os only, so no physical I/O involved, whereas as final test run I plan a physical I/O bound test set up.

SOUG Training Day May 2016 in Switzerland

Wed, 2016-02-24 12:27
I will be one of the speakers at the first SOUG performance training day, together with Christian Antognini, Franck Pachot and Clemens Bleile.

The event will take place in May this year in Switzerland at two different locations / days (one in German language, one in French, except mine, which will be in English).

My presentations will be:

- Analyzing and Troubleshooting Oracle Parallel Execution

- Advanced Oracle Troubleshooting

Hope to see you there!

Update: Here is the link to the official landing page of the event on the SOUG website.

Below is the official announcement published by SOUG with more details:


"Block the date ! SOUG performance training day the 18th and 19th of May exclusively in Switzerland"

In 2016 the Swiss Oracle User Group (SOUG) will introduce a new kind of event called the "SOUG training day".

During a whole day several world class speakers will teach the participants on a particular topic.

This year, Christian Antognini (Trivadis), Randolf Geist, Franck Pachot (dbi services) and Clemens Bleile (dbi services) will help the participant to manage the performances of their Oracle databases. The following topics will be addressed and explained in depth :

Strategies for Keeping Object Statistics Up-to-Date
Next-Generation Oracle Database - New Performance Features
Analyzing and Troubleshooting Oracle Parallel Execution
Advanced Oracle Troubleshooting
Interpreting AWR Reports - Straight to the Goal
All About Table Locks: DML, DDL, Foreign Key, Online Operations in regards to performance

The SOUG invites you to block the following dates to be sure that you do not miss such an incredible opportunity to meet the cracks :

- 18th of May in Olten - presentations in German
- 19th of May in Geneva - presentations in French (Mr Randolf Geist will present 2h in english)

This training day will be proposed to very attractive conditions :
- CHF 500.- for SOUG members
- CHF 1000.- for non-DOUG members

Advanced Oracle Troubleshooting - One Day Seminar, Moscow

Sun, 2016-02-14 16:05
I got invited by Luxoft Training to deliver my one day seminar "Advanced Oracle Troubleshooting" in Moscow end of March.

More details about what I cover in this seminar can be found here.

If you're interested, Luxoft Training has set up a page in Russian where you can find more details - but note that the seminar language will be English.

Big Nodes, Concurrent Parallel Execution And High System/Kernel Time

Sat, 2016-02-06 17:47
The following is probably only relevant for customers that run Oracle on big servers with lots of cores in single instance mode (this specific problem here doesn't reproduce in a RAC environment, see below for an explanation why), like one of my clients that makes use of the Exadata Xn-8 servers, for example a X4-8 with 120 cores / 240 CPUs per node (but also reproduced on older and smaller boxes with 64 cores / 128 CPUs per node).

They recently came up with a re-write of a core application functionality. Part of this code did start the same code path for different data sets potentially several times concurrently ending up with many sessions making use of Parallel Execution. In addition a significant part of the queries used by this code did make questionable use of Parallel Execution, in that sense that queries of very short duration used Parallel Execution, hence ending up with several Parallel Execution starts per second. You could see this pattern from the AWR reports like this, showing several "DFO trees" parallelized per second on average over an hour period:



When the new code was tested with production-like data volumes and patterns, in the beginning the CPU profile of such a big node (running in single instance mode) looked like this, when nothing else was running on that box:



As you can see, the node was completely CPU bound, spending most of the time in System/Kernel time. The AWR reports showed some pretty unusual PX wait events as significant:



"PX Deq: Slave Session Stats" shouldn't be a relevant wait event since it is about the PX slaves at the end of a PX execution passing an array of session statistics to the PX coordinator for aggregating the statistics on coordinator level. So obviously something was slowing down this PX communication significantly (and the excessive usage of Parallel Execution was required to see this happen).

Also some of the more complex Parallel Execution queries performing many joins and ending up with a significant number of data redistributions ran like in slow motion, although claiming to spend 100% of their time on CPU, but according to Active Session History almost 90% of that time was spent on the redistribution operations:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            98|             86|             87|


Running the same query with the same execution plan on the same data and the same box during idle times showed a almost 20 times better performance, and less than 40% time spent on redistribution:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            96|             38|             37|

So it looked like those queries ran into some kind of contention that wasn't instrumented in Oracle but happened outside on O/S level, showing up as CPU Kernel time - similar to what could be seen in previous versions of Oracle when spinning on mutexes.

Reducing the excessive usage of Parallel Execution showed a significant reduction in CPU time, but still the high System/Kernel time was rather questionable:



So the big question was - where was that time spent in the kernel to see whether this gives further clues.

Analysis
Running "perf top" on the node during such a run showed this profile:

  PerfTop:  129074 irqs/sec  kernel:76.4%  exact:  0.0% [1000Hz cycles],  (all, 128 CPUs)
-------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                 DSO
             _______ _____ ________________________ ___________________________________________________________

          1889395.00 67.8% __ticket_spin_lock       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            27746.00  1.0% ktime_get                /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            24622.00  0.9% weighted_cpuload         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            23169.00  0.8% find_busiest_group       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            17243.00  0.6% pfrfd1_init_locals       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            16961.00  0.6% sxorchk                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            15434.00  0.6% kafger                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            11531.00  0.4% try_atomic_semop         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            11006.00  0.4% __intel_new_memcpy       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10557.00  0.4% kaf_typed_stuff          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10380.00  0.4% idle_cpu                 /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             9977.00  0.4% kxfqfprFastPackRow       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             9070.00  0.3% pfrinstr_INHFA1          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8905.00  0.3% kcbgtcr                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8757.00  0.3% ktime_get_update_offsets /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             8641.00  0.3% kgxSharedExamine         /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             7487.00  0.3% update_queue             /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             7233.00  0.3% kxhrPack                 /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6809.00  0.2% rworofprFastUnpackRow    /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6581.00  0.2% ksliwat                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6242.00  0.2% kdiss_fetch              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6126.00  0.2% audit_filter_syscall     /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5860.00  0.2% cpumask_next_and         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5618.00  0.2% kaf4reasrp1km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5482.00  0.2% kaf4reasrp0km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5314.00  0.2% kopp2upic                /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5129.00  0.2% find_next_bit            /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             4991.00  0.2% kdstf01001000000km       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4842.00  0.2% ktrgcm                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4762.00  0.2% evadcd                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4580.00  0.2% kdiss_mf_sc              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle

Running "perf" on a number of Parallel Slaves being busy on CPU showed this profile:

     0.36%     ora_xxxx  [kernel.kallsyms]             [k] 

__ticket_spin_lock
               |
               --- __ticket_spin_lock
                  |          
                  |--99.98%-- _raw_spin_lock
                  |          |          
                  |          |--100.00%-- ipc_lock
                  |          |          ipc_lock_check
                  |          |          |          
                  |          |          |--99.83%-- semctl_main
                  |          |          |          sys_semctl
                  |          |          |          system_call
                  |          |          |          __semctl
                  |          |          |          |          
                  |          |          |           --100.00%-- skgpwpost
                  |          |          |                     kslpsprns
                  |          |          |                     kskpthr
                  |          |          |                     ksl_post_proc
                  |          |          |                     kxfprienq
                  |          |          |                     kxfpqrenq
                  |          |          |                     |          
                  |          |          |                     |--98.41%-- kxfqeqb
                  |          |          |                     |          kxfqfprFastPackRow
                  |          |          |                     |          kxfqenq
                  |          |          |                     |          qertqoRop
                  |          |          |                     |          kdstf01001010000100km
                  |          |          |                     |          kdsttgr
                  |          |          |                     |          qertbFetch
                  |          |          |                     |          qergiFetch
                  |          |          |                     |          rwsfcd
                  |          |          |                     |          qertqoFetch
                  |          |          |                     |          qerpxSlaveFetch
                  |          |          |                     |          qerpxFetch
                  |          |          |                     |          opiexe
                  |          |          |                     |          kpoal8

Running "strace" on those Parallel Slaves showed this:

.
.
.
semctl(1347842, 397, SETVAL, 0x1)       = 0
semctl(1347842, 388, SETVAL, 0x1)       = 0
semctl(1347842, 347, SETVAL, 0x1)       = 0
semctl(1347842, 394, SETVAL, 0x1)       = 0
semctl(1347842, 393, SETVAL, 0x1)       = 0
semctl(1347842, 392, SETVAL, 0x1)       = 0
semctl(1347842, 383, SETVAL, 0x1)       = 0
semctl(1347842, 406, SETVAL, 0x1)       = 0
semctl(1347842, 389, SETVAL, 0x1)       = 0
semctl(1347842, 380, SETVAL, 0x1)       = 0
semctl(1347842, 395, SETVAL, 0x1)       = 0
semctl(1347842, 386, SETVAL, 0x1)       = 0
semctl(1347842, 385, SETVAL, 0x1)       = 0
semctl(1347842, 384, SETVAL, 0x1)       = 0
semctl(1347842, 375, SETVAL, 0x1)       = 0
semctl(1347842, 398, SETVAL, 0x1)       = 0
semctl(1347842, 381, SETVAL, 0x1)       = 0
semctl(1347842, 372, SETVAL, 0x1)       = 0
semctl(1347842, 387, SETVAL, 0x1)       = 0
semctl(1347842, 378, SETVAL, 0x1)       = 0
semctl(1347842, 377, SETVAL, 0x1)       = 0
semctl(1347842, 376, SETVAL, 0x1)       = 0
semctl(1347842, 367, SETVAL, 0x1)       = 0
semctl(1347842, 390, SETVAL, 0x1)       = 0
semctl(1347842, 373, SETVAL, 0x1)       = 0
semctl(1347842, 332, SETVAL, 0x1)       = 0
semctl(1347842, 379, SETVAL, 0x1)       = 0
semctl(1347842, 346, SETVAL, 0x1)       = 0
semctl(1347842, 369, SETVAL, 0x1)       = 0
semctl(1347842, 368, SETVAL, 0x1)       = 0
semctl(1347842, 359, SETVAL, 0x1)       = 0
.
.
.

So the conclusion was: A lot of CPU time is spent spinning on the "spin lock" (critical code section) - caused by calls to "semctl" (semaphores), which are part of the PX code path and come from "ipc_lock"->"raw_lock". "strace" shows that all of the calls to "semctl" make use of the same semaphore set (first parameter), which explains the contention on that particular semaphore set (indicating that the locking granule is the semaphore set, not the semaphore).

Solution
Based on the "perf" results an Oracle engineer found a suitable, unfortunately unpublished and closed bug from 2013 for 12.1.0.2 that comes up with three different ways how to address the problem:

- Run with "cluster_database" = true: This will take a different code path which simply reduces the number of semaphore calls by two orders of magnitude. We tested this approach and it showed immediate relief on kernel time - that is the explanation why in a RAC environment this specific issue doesn't reproduce.

- Run with different "kernel.sem" settings: The Exadata boxes came with the following predefined semaphore configuration:

kernel.sem = 2048 262144 256 256

"ipcs" showed the following semaphore arrays with this configuration when starting the Oracle instance:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 12941057   oracle    640        1502     
0xd87a8935 12973826   oracle    640        1502     
0xd87a8936 12006595   oracle    640        1502    

By reducing the number of semaphores per set and increasing the number of sets, like this:

kernel.sem = 100 262144 256 4096

the following "ipcs" output could be seen:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 13137665   oracle    640        93       
0xd87a8935 13170434   oracle    640        93       
0xd87a8936 13203203   oracle    640        93       
0xd87a8937 13235972   oracle    640        93       
0xd87a8938 13268741   oracle    640        93       
0xd87a8939 13301510   oracle    640        93       
0xd87a893a 13334279   oracle    640        93       
0xd87a893b 13367048   oracle    640        93       
0xd87a893c 13399817   oracle    640        93       
0xd87a893d 13432586   oracle    640        93       
0xd87a893e 13465355   oracle    640        93       
0xd87a893f 13498124   oracle    640        93       
0xd87a8940 13530893   oracle    640        93       
0xd87a8941 13563662   oracle    640        93       
0xd87a8942 13596431   oracle    640        93       
0xd87a8943 13629200   oracle    640        93       
0xd87a8944 13661969   oracle    640        93       
0xd87a8945 13694738   oracle    640        93       
0xd87a8946 13727507   oracle    640        93       
0xd87a8947 13760276   oracle    640        93       
0xd87a8948 13793045   oracle    640        93       
0xd87a8949 13825814   oracle    640        93       
0xd87a894a 13858583   oracle    640        93       
0xd87a894b 13891352   oracle    640        93       
0xd87a894c 13924121   oracle    640        93       
0xd87a894d 13956890   oracle    640        93       
0xd87a894e 13989659   oracle    640        93       
0xd87a894f 14022428   oracle    640        93       
0xd87a8950 14055197   oracle    640        93       
0xd87a8951 14087966   oracle    640        93       
0xd87a8952 14120735   oracle    640        93       
0xd87a8953 14153504   oracle    640        93       
0xd87a8954 14186273   oracle    640        93       
0xd87a8955 14219042   oracle    640        93


So Oracle now allocated a lot more sets with less semaphores per set. We tested this configuration instead of using "cluster_database = TRUE" and got the same low kernel CPU times

- The bug comes up with a third option how fix this, which has the advantage that the host configuration doesn't need to be changed, and the configuration can be done per instance: There is an undocumented parameter "_sem_per_sem_id" that defines the upper limit of semaphores to allocate per set. By setting this parameter to some comparable values like 100 or 128 the net result ought to be the same - Oracle allocates more sets with less semaphores per set, but we haven't tested this option.

Conclusion
So the bottom line was this: Certain usage patterns of the Oracle instance lead to contention on spin locks on Linux O/S level if Oracle runs in single instance mode and used the so far recommended semaphore settings, which resulted in all semaphore calls going for the same semaphore set. By having Oracle allocate more semaphore sets the calls were spread over more sets hence significantly reducing the contention.

There is probably some internal note available at Oracle that indicates that the default semaphore settings recommended for big nodes are not optimal for running single instance mode under certain circumstances, but I don't know if there is a definitive, official guide available yet.

This is the CPU profile of exactly the same test workload as before using the changed "kernel.sem" settings:



Also in the AWR report the unusual PX related wait events went away and performance improved significantly, in particular also for those complex queries mentioned above.

DML Operations On Partitioned Tables Can Restart On Invalidation

Sun, 2016-01-17 13:12
It's probably not that well known that Oracle can actually rollback / re-start the execution of a DML statement should the cursor become invalidated. By rollback / re-start I mean that Oracle actually performs a statement level rollback (so any modification already performed by that statement until that point gets rolled back), performs another optimization phase of the statement on re-start (due to the invalidation) and begins the execution of the statement from scratch. Note that this can happen multiple times - actually it's possible to end up in a kind of infinite loop when this happens, leading to statements that can run for very, very long (I've seen statements on Production environments executing for several days although a single execution would only take minutes).

The pre-requisites to meet for this to happen are not that complex or exotic:

- The target table to manipulate needs to be partitioned

- The cursor currently executing gets invalidated - either by running DDL (typically think of partition related operations) - or simply by gathering statistics on one of the objects involved in the statement

- The DML statement hasn't touched yet one of the partitions of the target table but attempts to do so after the cursor got invalidated

When the last condition is met, the statement performs a rollback, and since it got invalidated - which is one of the conditions to be met - another optimization phase happens, meaning that it's also possible to get different execution plans for the different execution attempts. When the execution plan is ready the execution begins from scratch.

According to my tests the issue described here applies to both conventional and direct-path inserts, merge statements (insert / update / delete) as well as serial and parallel execution. I haven't explicitly tested UPDATE and DELETE statements, but the assumption is that they are affected, too.

The behaviour is documented in the following note on MOS: "Insert Statement On Partitioned Tables Is RE-Started After Invalidation (Doc ID 1462003.1)" which links to Bug "14102209 : INSERT STATEMENT' IS RESTARTING BY ITSELF AFTER INVALIDATION" where you can also find some more comments on this behaviour. The issue seems to be that Oracle at that point is no longer sure if the partition information compiled into the cursor for the partitioned target table is still correct or not (and internally raises and catches a corresponding error, like "ORA-14403: Cursor invalidation detected after getting DML partition lock", leading to the re-try), so it needs to refresh that information, hence the re-optimization and re-start of the cursor.

Note that this also means that the DML statement might already have performed modifications to other partitions but after being invalidated attempts to modify another partition it hasn't touched yet - it just needs an attempt to modify a partition not touched into yet by that statement.

It's also kind of nasty that the statement keeps running the potentially lengthy query part after being invalidated only to find out it needs to re-start after the first row is attempted to be applied to a target table partition not touched yet.

Note that applications typically run into this problem, when they behave like the following:

- There are longer running DML statements that take typically several seconds / minutes until they attempt to actually perform an modification to a partitioned target table

- They either use DBMS_STATS to gather stats on one of the involved tables, typically using NO_INVALIDATE=>FALSE, which leads to an immediate invalidation of all affected cursors

- And/Or they perform partition related operations on one of the tables involved, like truncating, creating or exchanging partitions. Note that it is important to point out that it doesn't matter which objects gets DDL / stats applied, so it's not limited to activity on the partitioned target table being modified - any object involved in the query can cause the cursor invalidation

In principle this is another variation of the general theme "Don't mix concurrent DDL with DML/queries on the same objects". Doing so is something that leads to all kinds of side effects, and the way the Oracle engine is designed means that it doesn't cope very well with doing so.

Here is a simple test case for reproducing the issue, using INSERTs in this case here (either via INSERT or MERGE statement):

create table t_target (
id number(*, 0) not null,
pkey number(*, 0) not null,
filler varchar2(500)
)
--segment creation immediate
partition by range (pkey) --interval (1)
(
partition pkey_0 values less than (1)
, partition pkey_1 values less than (2)
, partition pkey_2 values less than (3)
, partition pkey_3 values less than (4)
);

create table t_source
compress
as
select 1 as id, rpad('x', 100) as filler
from
(select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3),
(select /*+ cardinality(1e0) */ null from dual connect by level <= 1e0)
union all
select 1 as id, rpad('y', 100) as filler from dual;

-- Run this again once the DML statement below got started
exec dbms_stats.gather_table_stats(null, 't_source', no_invalidate=>false)

exec dbms_stats.gather_table_stats(null, 't_target', no_invalidate=>false)

----------------------------------------------------------------------------------------------------------------------------------
-- INSERT example --
-- Run above DBMS_STATS calls or any other command that invalidates the cursor during execution to force re-start of the cursor --
----------------------------------------------------------------------------------------------------------------------------------

set echo on timing on time on

-- alter session set tracefile_identifier = 'insert_restart';

-- alter session set events '10046 trace name context forever, level 12';

-- exec sys.dbms_monitor.session_trace_enable(waits => true, binds => true/*, plan_stat => 'all_executions'*/)

insert /* append */ into t_target (id, pkey, filler)
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 1 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 2 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 3 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
;

-- exec sys.dbms_monitor.session_trace_disable

----------------------------------------------------------------------------------------------------------------------------------
-- MERGE example --
-- Run above DBMS_STATS calls or any other command that invalidates the cursor during execution to force re-start of the cursor --
----------------------------------------------------------------------------------------------------------------------------------

set echo on timing on time on

merge /* append */ into t_target t
using (
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 1 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 2 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 3 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
) s
on (s.id = t.id)
when not matched then
insert (id, pkey, filler) values (s.id, s.pkey, s.filler)
;
The idea of the test case is to maximise the time until each UNION ALL branch produces data to insert by performing an inefficient HASH JOIN (that in fact generates a Cartesian product and needs to apply a costly REGEXP filter on that huge intermediate result) and forcing a sort on the join result, so rows will only be handed over to the parent operations until all rows were processed in the join operation - and each branch generates data for a different partition of the target table. Typically it should take several seconds per branch to execute (if you need more time just un-comment the additional REGEXP_REPLACE filters), so you should have plenty of time to cause the invalidation from another session.

This means during the execution of each branch invalidating the cursor (for example by executing either of the two DBMS_STATS calls on the source or target table using NO_INVALIDATE=>FALSE) will lead to a re-start of the statement at the next attempt to write into a new target partition, possibly rolling back rows already inserted into other partitions.

Diagnostics
If you run the provided INSERT or MERGE statement on newer versions of Oracle that include the SQL_EXEC_START and SQL_EXEC_ID in V$ACTIVE_SESSION_HISTORY (or V$SESSION for that matter) and invalidate the cursor during execution and before a partition of the target table gets inserted for the first time then you can see that these entries change as the statement restarts.

In such cases the INVALIDATIONS and LOADS increase in V$SQL accordingly and the OBJECT_STATUS changes from INVALID_UNAUTH to VALID again with each re-start attempt. In newer versions where you can configure the "plan_stat" information for SQL trace to "all_executions" you'll find STAT lines for each execution attempt dumped to the trace file, but only a single final EXEC line, where the elapsed time covers all execution attempts.

The oldest version I've tested was 10.2.0.4, and that one showed already the re-start behaviour, although I would be inclined to think that this wasn't the case with older versions. So if anybody still runs older versions than 10.2.0.4 I would be interested to hear whether the behaviour reproduces or not.

Video Tutorial: XPLAN_ASH Active Session History - Part 7

Mon, 2016-01-04 03:00
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 3: The Optimizer And Distribution Methods

Wed, 2015-12-30 13:45
As mentioned in the first and second part of this instalment the different available distribution methods of the new parallel FILTER are selected automatically by the optimizer - in this last post of this series I want to focus on that optimizer behaviour.

It looks like there are two new optimizer related parameters that control the behaviour of the new feature: "_px_filter_parallelized" is the overall switch to enable/disable the new parallel filter capability - and defaults to "true" in 12c, and "_px_filter_skew_handling" influences how the optimizer determines the distribution methods - the parameter naming suggests that it somehow has to do with some kind of "skew" - note that the internal parameter that handles the new automatic join skew handling is called "_px_join_skew_handling" - rather similar in name.

But even after playing around with the feature for quite a while I couldn't come up with a good test case where the optimizer chose a different distribution method based on the typical data distribution skew patterns - so that the expression used for the FILTER lookup had some more popular values than others. So I got in touch with Yasin Baskan - product manager for Parallel Execution at Oracle, asking what kind of skew is meant to see a difference in behaviour.

As it turns out "skew" means something different in this context here. When the mentioned parameter "_px_filter_skew_handling" is set to "true" (default value in 12c) the optimizer will choose a different distribution method based on the size of object driving the filter. According to my tests this effectively means: If the object is such small that only one granule (usually 13 blocks) per PX slave can be assigned the optimizer will use automatically a HASH distribution, otherwise - if the object is larger than this threshold - no re-distribution will be selected. I wasn't able to come up with an example where the optimizer automatically comes up with the other available distribution method, which is RANDOM / ROUND-ROBIN (see previous post). To demonstrate the point, here is a small example:

create table t2 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't2')

create table t3 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
parallel(4)
--opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
t3 t
--(select /*+ no_merge */ a.* from t3 a) t
--(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Default plan, no redistribution before parallel FILTER
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 5 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 5 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 5 | | | Q1,00 | PCWP | |
|* 5 | FILTER | | | | | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------

exec dbms_stats.set_table_stats(null, 't3', numblks => 52)

-- Setting stats of T3 to 52 (13 * DOP) blocks or smaller - HASH distribution will be used, 53 blocks or greater => no redistribution
-------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 5 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 5 | | | Q1,01 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 5 | | | Q1,01 | PCWP | |
|* 5 | FILTER | | | | | | Q1,01 | PCWP | |
| 6 | PX RECEIVE | | 90000 | 439K| 5 (20)| 00:00:01 | Q1,01 | PCWP | |
| 7 | PX SEND HASH | :TQ10000 | 90000 | 439K| 5 (20)| 00:00:01 | Q1,00 | P->P | HASH |
| 8 | PX BLOCK ITERATOR | | 90000 | 439K| 5 (20)| 00:00:01 | Q1,00 | PCWC | |
| 9 | TABLE ACCESS FULL| T3 | 90000 | 439K| 5 (20)| 00:00:01 | Q1,00 | PCWP | |
|* 10 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-------------------------------------------------------------------------------------------------------------------
So this example shows that the HASH distribution will be used by the optimizer if the object T3 driving the FILTER operation is 52 blocks or smaller, which corresponds to 13 blocks per PX slave at a degree of 4.

Now I find this behaviour pretty odd to explain - since usually you wouldn't want to use Parallel Execution on such a small object anyway. But things become even worse: Not only to me the "skew" handling based on the object size is questionable, but the behaviour can become a potential threat if the row source driving the FILTER operator no longer is a plain table but the result of a more complex operation, which can be simply a join or non-mergeable view:

-- Resetting stats to true size of table - this would mean no redistribution at a DOP of 4, see above
exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
parallel(4)
--opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
--t3 t
(select /*+ no_merge */ a.* from t3 a) t
--(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- But simply using a NO_MERGE hint on the select from the simple T3 row source results in an unnecessary HASH re-distribution
--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 9755K (1)| 00:06:22 | | | |
| 1 | SORT AGGREGATE | | 1 | 13 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 13 | | | Q1,01 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 13 | | | Q1,01 | PCWP | |
|* 5 | FILTER | | | | | | Q1,01 | PCWP | |
| 6 | PX RECEIVE | | 90000 | 1142K| 114 (0)| 00:00:01 | Q1,01 | PCWP | |
| 7 | PX SEND HASH | :TQ10000 | 90000 | 1142K| 114 (0)| 00:00:01 | Q1,00 | P->P | HASH |
| 8 | VIEW | | 90000 | 1142K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
| 9 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 10 | TABLE ACCESS FULL| T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
|* 11 | TABLE ACCESS FULL | T2 | 1 | 5 | 114 (0)| 00:00:01 | | | |
--------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
parallel(4)
--opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
--t3 t
--(select /*+ no_merge */ a.* from t3 a) t
(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- If we use a simple join as driving row source again a HASH re-distribution before the FILTER gets added
-- As a result the dreaded HASH JOIN BUFFERED will be used instead of the plain HASH JOIN
-------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 10 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10003 | 1 | 10 | | | Q1,03 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 10 | | | Q1,03 | PCWP | |
|* 5 | FILTER | | | | | | Q1,03 | PCWP | |
| 6 | PX RECEIVE | | 90000 | 878K| 229 (1)| 00:00:01 | Q1,03 | PCWP | |
| 7 | PX SEND HASH | :TQ10002 | 90000 | 878K| 229 (1)| 00:00:01 | Q1,02 | P->P | HASH |
|* 8 | HASH JOIN BUFFERED | | 90000 | 878K| 229 (1)| 00:00:01 | Q1,02 | PCWP | |
| 9 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 10 | PX SEND HYBRID HASH | :TQ10000 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | P->P | HYBRID HASH|
| 11 | STATISTICS COLLECTOR | | | | | | Q1,00 | PCWC | |
| 12 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 13 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
| 14 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 15 | PX SEND HYBRID HASH | :TQ10001 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | P->P | HYBRID HASH|
| 16 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWC | |
| 17 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 18 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-------------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
parallel(4)
opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
--t3 t
--(select /*+ no_merge */ a.* from t3 a) t
(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Disabling the FILTER skew handling behaviour means no re-distribution before the FILTER, and hence no HASH JOIN BUFFERED
-----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 10 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10002 | 1 | 10 | | | Q1,02 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 10 | | | Q1,02 | PCWP | |
|* 5 | FILTER | | | | | | Q1,02 | PCWC | |
|* 6 | HASH JOIN | | 90000 | 878K| 229 (1)| 00:00:01 | Q1,02 | PCWP | |
| 7 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 8 | PX SEND HYBRID HASH | :TQ10000 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | P->P | HYBRID HASH|
| 9 | STATISTICS COLLECTOR | | | | | | Q1,00 | PCWC | |
| 10 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 11 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
| 12 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 13 | PX SEND HYBRID HASH | :TQ10001 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | P->P | HYBRID HASH|
| 14 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWC | |
| 15 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 16 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------------
So it looks like if the row source driving the parallel FILTER operator is complex (in this case by complex I mean not a simple table) the optimizer will always add a HASH distribution unconditionally before the FILTER. It it obvious that such a re-distribution adds overhead - it requires resources to perform. What is even worse is that in general the rules is: The more redistributions the more likely the dreaded buffering will be added to the execution plans, as can be seen from the example above, where the HASH JOIN turns into a HASH JOIN BUFFERED due to the HASH distribution by default added by the optimizer after the join and before the FILTER. By disabling the filter "skew" handling this in my opinion unnecessary redistribution doesn't show up and hence the HASH JOIN without buffering can be used in this example.

Summary
The new parallel FILTER operator comes with different distribution methods available to the optimizer. However, at present the way the optimizer determines automatically if and how to re-distribute the data seems to be questionable to me.

The skew handling is based on the size of the driving object - for very small objects a re-distribution gets added before the FILTER. For row sources driving the filter that are no simple tables the skew handling seems to add a re-distribution unconditionally.

For the reasons outlined at present I would recommend considering to disable the filter skew handling by setting the parameter "_px_filter_skew_handling" to "false", of course not without getting the blessing from Oracle Support before doing so - this should allow minimising the number of re-distributions added to an execution plan. Losing the capability of handling the "skew" caused by very small objects in my opinion is negligible in most cases.

New Version Of XPLAN_ASH Utility

Mon, 2015-12-28 13:35
A new version 4.22 of the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

This version primarily addresses an issue with 12c - if the HIST mode got used to pull ASH information from AWR in 12c it turned out that Oracle forgot to add the new "DELTA_READ_MEM_BYTES" columns to DBA_HIST_ACTIVE_SESS_HISTORY - although it got officially added to V$ACTIVE_SESSION_HISTORY in 12c. So now I had to implement several additional if/then/else constructs to the script to handle this inconsistency. It's the first time that the HIST view doesn't seem to reflect all columns from the V$ view - very likely an oversight rather than by design I assume.

Apart from that the I/O figures (Read Bytes / Write Bytes etc.) in the "Activity Timeline" make more sense for those cases where a process hasn't been sampled for several sample points (see below for more details).

Also in case an execution plan could not be found it is now made more obvious with a corresponding message that you might be able to pull the execution plan from AWR by using different ASH modes (MIXED / HIST).

Here are the notes from the change log:

- Fixed a funny bug that in 12c they have forgotton to add the DELTA_READ_MEM_BYTES to DBA_HIST_ACTIVE_SESS_HISTORY, so in HIST mode with 12c prior XPLAN_ASH versions could error out with invalid column name

- Change the way the I/O figures are treated in the "Activity Timeline based on ASH". Now the I/O per second is spread over the (previous) samples covered by DELTA_TIME. This should give a smoother representation of the I/O performed and much closer to what you see in Real-Time SQL Monitoring reports. The difference to prior versions is only visible in cases where a session wasn't sampled for quite a while and hence has a DELTA_TIME spanning multiple previous sample points. This also means that the I/O related columns in the "Activity Timeline based on ASH" now show only the PER SECOND values, no longer to the totals like prior versions

- Added a SET NULL "" in the configuration and initialization section for SQL*Plus environments that use a non-default SET NULL setting. This screwed up some internal switches so that XPLAN_ASH for example thought it's running in a S-ASH repository

- Added a note to the end of the output if no execution plan could be found and falling back to retrieving plan operation details from ASH. Also added the note to use MIXED or HIST ASH source option if no execution plan could be found in CURR mode, so execution plan has been purged from Shared Pool in the meanwhile

- Cloned the "cleanup" section from the end to the beginning of the script to ensure no current SQL*Plus environment settings influence the script execution. This is particularly relevant if the script execution gets cancelled before the final cleanup section is reached or some other, previous scripts left a mess behind

IT Tage 2015 - "Analysing and troubleshooting Parallel Execution" presentation material

Sun, 2015-12-20 12:24
Thanks to all attending my presentation "Analysing and troubleshooting Parallel Execution" at the IT Tage conference 2015 in Frankfurt, Germany. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.

DOAG 2015 - "Oracle 12c Parallel Execution New Features" presentation material

Sun, 2015-11-22 13:16
Thanks to the many attendees that came to my presentation "Oracle 12c Parallel Execution New Features" at the DOAG conference 2015. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 2: Distribution Methods

Fri, 2015-10-16 15:31
Picking up from the first part of this instalment I'll focus in this post on the available distribution methods for the new parallel FILTER subquery feature.

In this post I won't go into the details how the optimizer selects the distribution method automatically - this will be covered in the last part.

Here I merely describe the different available methods and how to control them using the new PQ_FILTER hint, which is also mentioned in the official documentation, although I find a bit hard to follow the description there.

There are four different options available to the PQ_FILTER hint, and only two of them actually describe a distribution method. One of them tells to not distribute the data at all, and the last one reverts to the former pre-12c plan shape:

SERIAL: This tells the optimizer to use the pre-12c plan shape where the FILTER operator is executed by the Query Coordinator. Depending on the SQL features used and combined, 12c sometimes still reverts to this plan shape and it looks like in those cases you can't force the parallel filter evaluation via the PQ_FILTER hint - the outline then contains a PQ_FILTER hint that is supposed to result in a parallel filter evaluation, but the serial plan shape gets used anyway despite the hint.

NONE: This tells the optimizer to not redistribute the data before executing the FILTER operator driving the subqueries, which means in whatever way the data is distributed the Parallel Execution Servers will execute the filter and the corresponding subqueries for the data currently processed

HASH: Redistribute the data by HASH before running the FILTER operator and the corresponding subqueries.

RANDOM: Redistribute the data by ROUND-ROBIN before running the FILTER operator and the corresponding subqueries. Despite being called RANDOM the distribution method shown in the plan is ROUND-ROBIN rather than RANDOM.

In this post I want to show an example for each of those variants and address the following questions in addition:

1. Where and how has the PQ_FILTER hint to be specified? How has the hint to be used in case of multiple FILTERs with subqueries in the execution plan?

2. In case of a HASH distribution and multiple subqueries what keys get used as input for the HASH function used for distribution?

Let's start with an example for each of the hints used to demonstrate the different plan shapes. For that purpose I'll re-use the same setup and query from the initial part of this instalment, with the index set to invisible:

create table t_1
compress
as
select /*+ use_nl(a b) */
rownum as id
, rpad('x', 100) as filler
from
(select /*+ cardinality(1e5) */ * from dual
connect by
level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel 4;

create index t_1_idx on t_1 (id) invisible;

explain plan for
select /*+
pq_filter(<DIST_METHOD>)
*/ count(*) from
t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id);

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(SERIAL)
------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM)| :TQ20000 | Q2,00 | P->S | QC (RAND) |
| 5 | PX BLOCK ITERATOR | | Q2,00 | PCWC | |
| 6 | TABLE ACCESS FULL| T_1 | Q2,00 | PCWP | |
| 7 | PX COORDINATOR | | | | |
| 8 | PX SEND QC (RANDOM)| :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 9 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
|* 10 | TABLE ACCESS FULL| T_1 | Q1,00 | PCWP | |
------------------------------------------------------------------------

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(NONE)
-------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | | | |
| 2 | PX COORDINATOR | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | Q1,00 | PCWP | |
|* 5 | FILTER | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T_1 | | | |
-------------------------------------------------------------------------

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(RANDOM)
---------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | | | |
| 2 | PX COORDINATOR | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | Q1,01 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | Q1,01 | PCWP | |
|* 5 | FILTER | | Q1,01 | PCWP | |
| 6 | PX RECEIVE | | Q1,01 | PCWP | |
| 7 | PX SEND ROUND-ROBIN| :TQ10000 | Q1,00 | P->P | RND-ROBIN |
| 8 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 9 | TABLE ACCESS FULL| T_1 | Q1,00 | PCWP | |
|* 10 | TABLE ACCESS FULL | T_1 | | | |
---------------------------------------------------------------------------

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(HASH)
---------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | | | |
| 2 | PX COORDINATOR | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | Q1,01 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | Q1,01 | PCWP | |
|* 5 | FILTER | | Q1,01 | PCWP | |
| 6 | PX RECEIVE | | Q1,01 | PCWP | |
| 7 | PX SEND HASH | :TQ10000 | Q1,00 | P->P | HASH |
| 8 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 9 | TABLE ACCESS FULL| T_1 | Q1,00 | PCWP | |
|* 10 | TABLE ACCESS FULL | T_1 | | | |
---------------------------------------------------------------------------
So we can see with PQ_FILTER(SERIAL) we get the former, pre-12c plan shape, which in this case results in a parallel full table scan in the filter subquery and a decomposition into multiple DFO trees, both things that should be avoided in general.

I've omitted the cost estimates but there seems to be an odd inconsistency: The new parallel FILTER evaluation results in a higher cost estimate than the old serial one, simply due to the fact that the full table scan cost is lowered by the chunked parallel table scan (PX BLOCK ITERATOR) in the old plan shape, whereas the complete full table scans performed in the Parallel Execution Servers is treated as serial full table scan cost-wise. Nevertheless by default the plan with the higher costs is selected by the optimizer in 12c.

Let's address the first question how to specify the hint: As we can see from the simple example here when using the format of the hint mentioned in the official documentation it should be put into the main (driving) query, not into the filter subqueries, which makes kind of sense, since there could be multiple subqueries but only one FILTER operation in the main query driving them.

Let's have a look at a slightly more complicated query:

explain plan for
select /*+
qb_name(main)
-- When merging the two query blocks there will be a single FILTER in the main query block
--pq_filter(@main hash)
-- Global form of the PQ_FILTER hint
no_merge(@qb1)
no_merge(@qb2)
pq_filter(@qb1 random)
pq_filter(@qb2 hash)
*/
*
from
(
select /*+
qb_name(qb1)
-- Alternatively use local form of the hints
--pq_filter(random)
--no_merge
*/ *
from
t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) a,
(
select /*+
qb_name(qb2)
-- Alternatively use local form of the hints
--pq_filter(hash)
--no_merge
*/ *
from t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) b
where b.id = a.id
;

-- Plan shape with the hints as specified
-------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | PX COORDINATOR | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10004 | Q1,04 | P->S | QC (RAND) |
|* 3 | HASH JOIN BUFFERED | | Q1,04 | PCWP | |
| 4 | PX RECEIVE | | Q1,04 | PCWP | |
| 5 | PX SEND HYBRID HASH | :TQ10002 | Q1,02 | P->P | HYBRID HASH|
| 6 | STATISTICS COLLECTOR | | Q1,02 | PCWC | |
| 7 | BUFFER SORT | | Q1,02 | PCWP | |
| 8 | VIEW | | Q1,02 | PCWP | |
|* 9 | FILTER | | Q1,02 | PCWP | |
| 10 | PX RECEIVE | | Q1,02 | PCWP | |
| 11 | PX SEND ROUND-ROBIN| :TQ10000 | Q1,00 | P->P | RND-ROBIN |
| 12 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 13 | TABLE ACCESS FULL| T_1 | Q1,00 | PCWP | |
|* 14 | TABLE ACCESS FULL | T_1 | | | |
| 15 | PX RECEIVE | | Q1,04 | PCWP | |
| 16 | PX SEND HYBRID HASH | :TQ10003 | Q1,03 | P->P | HYBRID HASH|
| 17 | BUFFER SORT | | Q1,03 | PCWP | |
| 18 | VIEW | | Q1,03 | PCWP | |
|* 19 | FILTER | | Q1,03 | PCWP | |
| 20 | PX RECEIVE | | Q1,03 | PCWP | |
| 21 | PX SEND HASH | :TQ10001 | Q1,01 | P->P | HASH |
| 22 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
| 23 | TABLE ACCESS FULL | T_1 | Q1,01 | PCWP | |
|* 24 | TABLE ACCESS FULL | T_1 | | | |
-------------------------------------------------------------------------------
So in this example here we can see that multiple FILTER operations with subqueries are possible - in principle one per query block. Here I've used the global form of the hint, which includes a query block name to assign the hint to the corresponding FILTER operation. The same plan shape could be achieved by using the local hints within the query blocks. If no NO_MERGE hint gets used, both query blocks will be merged and there will be only a single FILTER operation - which then can be influenced by a corresponding PQ_FILTER hint in the main/outer query block.

Regarding the second question, what keys get used as input for the HASH distribution, we can check the corresponding "Projection information" section of the DBMS_XPLAN output, and we can see there that it is pretty straightforward and as expected: All columns / expressions used in the filter subqueries for correlation will be used as input. Using the variation of above query with merged query blocks and HASH distribution of the single FILTER with two subqueries results in the following:

explain plan for
select /*+
qb_name(main)
pq_filter(@main hash)
--no_merge(@qb1)
--no_merge(@qb2)
--pq_filter(@qb1 random)
--pq_filter(@qb2 hash)
*/
*
from
(
select /*+
qb_name(qb1)
--pq_filter(random)
--no_merge
*/ *
from
t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) a,
(
select /*+
qb_name(qb2)
--pq_filter(hash)
--no_merge
*/ *
from t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) b
where b.id = a.id
;

--------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | PX COORDINATOR | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10003 | Q1,03 | P->S | QC (RAND) |
| 3 | BUFFER SORT | | Q1,03 | PCWP | |
|* 4 | FILTER | | Q1,03 | PCWP | |
| 5 | PX RECEIVE | | Q1,03 | PCWP | |
| 6 | PX SEND HASH | :TQ10002 | Q1,02 | P->P | HASH |
|* 7 | HASH JOIN BUFFERED | | Q1,02 | PCWP | |
| 8 | PX RECEIVE | | Q1,02 | PCWP | |
| 9 | PX SEND HYBRID HASH | :TQ10000 | Q1,00 | P->P | HYBRID HASH|
| 10 | STATISTICS COLLECTOR | | Q1,00 | PCWC | |
| 11 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 12 | TABLE ACCESS FULL | T_1 | Q1,00 | PCWP | |
| 13 | PX RECEIVE | | Q1,02 | PCWP | |
| 14 | PX SEND HYBRID HASH | :TQ10001 | Q1,01 | P->P | HYBRID HASH|
| 15 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
| 16 | TABLE ACCESS FULL | T_1 | Q1,01 | PCWP | |
|* 17 | TABLE ACCESS FULL | T_1 | | | |
|* 18 | TABLE ACCESS FULL | T_1 | | | |
--------------------------------------------------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------
...
6 - (#keys=2) "T"."ID"[NUMBER,22], "T"."ID"[NUMBER,22], "T"."FILLER"[VARCHAR2,100], "T"."FILLER"[VARCHAR2,100]
...
So both T.IDs from each subquery are used as keys "(#keys=2)" for the hash function.

In the last part of this instalment I'll focus on how the optimizer chooses the distribution method for the parallel filter operation.

Parallel Projection

Sun, 2015-08-16 09:09
A recent case at a client reminded me of something that isn't really new but not so well known - Oracle by default performs evaluation at the latest possible point in the execution plan.

So if you happen to have expressions in the projection of a simple SQL statement that runs parallel it might be counter-intuitive that by default Oracle won't evaluate the projection in the Parallel Slaves but in the Query Coordinator - even if it was technically possible - because the latest possible point is the SELECT operation with the ID = 0 of the plan, which is always performed by the Query Coordinator.

Of course, if you make use of expressions that can't be evaluated in parallel or aren't implemented for parallel evaluation, then there is no other choice than doing this in the Query Coordinator.

The specific case in question was a generic export functionality that allowed exporting report results to some CSV or Excel like format, and some of these reports had a lot of rows and complex - in that case CPU intensive - expressions in their projection clause.

When looking at the run time profile of such an export query it became obvious that although it was a (very simple) parallel plan, all of the time was spent in the Query Coordinator, effectively turning this at runtime into a serial execution.

This effect can be reproduced very easily:

create table t_1
compress
as
select /*+ use_nl(a b) */
rownum as id
, rpad('x', 100) as filler
from
(select /*+ cardinality(1e5) */ * from dual
connect by
level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel cache;

-- Run some CPU intensive expressions in the projection
-- of a simple parallel Full Table Scan
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select
regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
;

-- The plan is clearly parallel
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2000K| 192M| 221 (1)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM)| :TQ10000 | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 3 | PX BLOCK ITERATOR | | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 4 | TABLE ACCESS FULL| T_1 | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------------

-- But the runtime profile looks more serial
-- although the Parallel Slaves get used to run the Full Table Scan
-- All time spent in the operation ID = 0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Operation | Name | Execs | A-Rows| ReadB | ReadReq | Start | Dur(T)| Dur(A)| Time Active Graph | Parallel Distribution ASH | Parallel Execution Skew ASH | Activity Graph ASH | Top 5 Activity ASH |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | SELECT STATEMENT | | 5 | 2000K | | | 3 | 136 | 120 | #################### | 1:sqlplus.exe(120)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0] | ################################ | @@@@@@@@@@@@@@@@@@@ ( 98%) | ON CPU(120) |
| 1 | 0 | PX COORDINATOR | | 5 | 2000K | | | 119 | 1 | 1 | # | 1:sqlplus.exe(1)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0] | | ( .8%) | ON CPU(1) |
| 2 | 1 | PX SEND QC (RANDOM)| :TQ10000 | 4 | 2000K | | | 66 | 11 | 2 | ## | 2:P00B(1)[508K],P00A(1)[490K],P008(0)[505K],P009(0)[497K],sqlplus.exe(0)[0] | | (1.6%) | PX qref latch(2) |
| 3 | 2 | PX BLOCK ITERATOR | | 4 | 2000K | | | | | | | 0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] | | | |
|* 4 | 3 | TABLE ACCESS FULL| T_1 | 52 | 2000K | 23M | 74 | | | | | 0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] | | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fortunately there is a simple and straightforward way to make use of the Parallel Slaves for evaluation of projection expressions that can be evaluated in parallel - simply add a suitable NO_MERGE hint for the query block that you want the projection to be evaluated for in the Parallel Slaves.

If you don't want to have side effects on the overall plan shape by not merging views you could always wrap the original query in an outer SELECT and not merging the now inner query block. There seems to be a rule that the projection of a view always get evaluated at the VIEW operator, and if we check the execution plan we can see that the VIEW operator is marked parallel:

set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select /*+ no_merge(x) */ * from (
select
regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
, regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
) x
;

-- View operator is marked parallel
-- This is were the projection clause of the VIEW will be evaluated
---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2000K| 11G| 221 (1)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10000 | 2000K| 11G| 221 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 3 | VIEW | | 2000K| 11G| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
| 4 | PX BLOCK ITERATOR | | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 5 | TABLE ACCESS FULL| T_1 | 2000K| 192M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
---------------------------------------------------------------------------------------------------------------

-- Runtime profile now shows effective usage of Parallel Slaves
-- for doing the CPU intensive work
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Operation | Name | Execs | A-Rows| Start | Dur(T)| Dur(A)| Time Active Graph | Parallel Distribution ASH | Parallel Execution Skew ASH| Activity Graph ASH | Top 5 Activity ASH |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | SELECT STATEMENT | | 5 | 2000K | | | | | 0:sqlplus.exe(0)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0] | | | |
| 1 | 0 | PX COORDINATOR | | 5 | 2000K | 17 | 63 | 10 | # ## # #### | 1:sqlplus.exe(10)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0] | #### | * (5.6%) | resmgr:cpu quantum(10) |
| 2 | 1 | PX SEND QC (RANDOM) | :TQ10000 | 4 | 2000K | 5 | 61 | 10 | ## # ## ## ## # | 3:P002(5)[544K],P001(4)[487K],P000(1)[535K],P003(0)[434K],sqlplus.exe(0)[0] | # | (5.6%) | ON CPU(7),resmgr:cpu quantum(3) |
| 3 | 2 | VIEW | | 4 | 2000K | 2 | 82 | 69 | #################### | 4:P003(42)[434K],P001(35)[487K],P000(26)[535K],P002(22)[544K],sqlplus.exe(0)[0] | ############ | @@@@@@@@@@@@@@@@@@@ ( 70%) | ON CPU(125) |
| 4 | 3 | PX BLOCK ITERATOR | | 4 | 2000K | | | | | 0:P002(0)[544K],P000(0)[535K],P001(0)[487K],P003(0)[434K],sqlplus.exe(0)[0] | | | |
|* 5 | 4 | TABLE ACCESS FULL| T_1 | 52 | 2000K | 3 | 78 | 29 | ###### ####### # ### | 4:P000(11)[535K],P002(8)[544K],P001(8)[487K],P003(7)[434K],sqlplus.exe(0)[0] | ### | ***** ( 19%) | resmgr:cpu quantum(30),ON CPU(4) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
At runtime the duration of the query now gets reduced significantly and we can see the Parallel Slaves getting used when the VIEW operator gets evaluated. Although the overall CPU time used is similar to the previous example, the duration of the query execution is less since this CPU time is now spent in parallel in the slaves instead in the Query Coordinator.

Summary
By default Oracle performs evaluation at the latest possible point of the execution plan. Sometimes you can improve runtime by actively influencing when the projection will be evaluated by preventing view merging and introducing a VIEW operator that will be used to evaluate the projection clause.

The optimizer so far doesn't seem to incorporate such possibilities in its evaluations of possible plan shapes, so this is something you need to do manually up to and including Oracle 12c (version 12.1.0.2 as of time of writing this).

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 1: Introduction

Sun, 2015-07-26 12:11
12c introduces another interesting new Parallel Execution feature - the parallel evaluation of FILTER subqueries. In pre-12c FILTER subqueries always had to be evaluated in the Query Coordinator. This had several consequences, in particular the data driving the FILTER subquery always had to flow through the Query Coordinator, and hence represented a forced serial execution part of a parallel execution plan. This limitation also meant that depending on the overall plan shape the parallel plan was possibly decomposed into multiple DFO trees, leading to other side effects I've outlined in some of my other publications already.

In 12c now the FILTER subquery can be evaluated in the Parallel Slaves, and the driving data no longer needs to be processed in the Query Coordinator. However, the resulting plan shape can be a little bit confusing. Let's have a look at a simple example:

create table t_1
compress
as
select /*+ use_nl(a b) */
rownum as id
, rpad('x', 100) as filler
from
(select /*+ cardinality(1e5) */ * from dual
connect by
level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel 4;

create index t_1_idx on t_1 (id) invisible;

explain plan for
select /*+
--optimizer_features_enable('11.2.0.4')
*/ count(*) from
t_1 t
where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id);

-- 11.2.0.4 plan shape with index invisible
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 440M (2)| 04:47:04 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
|* 2 | FILTER | | | | | | | | |
| 3 | PX COORDINATOR | | | | | | | | |
| 4 | PX SEND QC (RANDOM)| :TQ20000 | 2000K| 11M| 221 (1)| 00:00:01 | Q2,00 | P->S | QC (RAND) |
| 5 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q2,00 | PCWC | |
| 6 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q2,00 | PCWP | |
| 7 | PX COORDINATOR | | | | | | | | |
| 8 | PX SEND QC (RANDOM)| :TQ10000 | 1 | 6 | 222 (2)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 9 | PX BLOCK ITERATOR | | 1 | 6 | 222 (2)| 00:00:01 | Q1,00 | PCWC | |
|* 10 | TABLE ACCESS FULL| T_1 | 1 | 6 | 222 (2)| 00:00:01 | Q1,00 | PCWP | |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
10 - filter("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index invisible
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 1588M (2)| 17:14:09 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 6 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 6 | | | Q1,00 | PCWP | |
|* 5 | FILTER | | | | | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T_1 | 1 | 6 | 798 (2)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
8 - filter("T_1"."ID"=:B1)

-- 11.2.0.4 plan shape with index visible
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 5973K (1)| 00:03:54 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
|* 2 | FILTER | | | | | | | | |
| 3 | PX COORDINATOR | | | | | | | | |
| 4 | PX SEND QC (RANDOM)| :TQ10000 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) |
| 5 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 7 | INDEX RANGE SCAN | T_1_IDX | 1 | 6 | 3 (0)| 00:00:01 | | | |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
7 - access("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index visible
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 5973K (1)| 00:03:54 | | | |
| 1 | SORT AGGREGATE | | 1 | 6 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 6 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 6 | | | Q1,00 | PCWP | |
|* 5 | FILTER | | | | | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | INDEX RANGE SCAN | T_1_IDX | 1 | 6 | 3 (0)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
8 - access("T_1"."ID"=:B1)

I've included two variations of the setup, one without available index for evaluating the FILTER subquery and one with index.

The pre-12c plan shape without index makes the former limitation particularly obvious: The FILTER operator is above the PX COORDINATOR and marked serial, and the table scan in the FILTER subquery gets parallelized as separate DFO tree (indicated among others by the two PX COORDINATOR operators), which means that each time this separate DFO tree starts, a separate set of Parallel Slave will be allocated/deallocated, adding possibly a lot of overhead to a probably already inefficient execution plan anyway - assuming the FILTER subquery needs to be evaluated many times.

In 12c the FILTER operator is marked parallel and the need for a separate DFO tree is gone. What might be confusing with this plan shape is that the operations of the FILTER subquery are not marked parallel. In my opinion this is misleading and should actually be marked parallel, because at runtime the operations will be performed by the Parallel Slaves, and in case of a Full Table Scan each slave will run the entire full table scan (so no PX ITERATOR for dividing the scan into chunks / granules), which is comparable to what happens when a parallel Nested Loop join runs or the new PQ_REPLICATE feature gets used - and in those cases the operations are marked parallel:

-- 11.2.0.4 / 12.1.0.2 plan shape with index invisible
-- and subquery unnested using NL SEMI join
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 12 | 442M (2)| 04:48:03 | | | |
| 1 | SORT AGGREGATE | | 1 | 12 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 12 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 12 | | | Q1,00 | PCWP | |
| 5 | NESTED LOOPS SEMI | | 2000K| 22M| 442M (2)| 04:48:03 | Q1,00 | PCWP | |
| 6 | PX BLOCK ITERATOR | | | | | | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T_1 | 2000K| 11M| 221 (1)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T_1 | 2000K| 11M| 796 (2)| 00:00:01 | Q1,00 | PCWP | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

8 - filter("T"."ID"="T_1"."ID")

Summary
So the operators of the FILTER subquery can now be run in the slaves, which is the main point of this feature, although being represented in a confusing way in the execution plan. Avoiding the potential decomposition into multiple DFO trees is another possible side effect. Decreased query duration should be possible if the evaluation of the FILTER subquery requires significant time and can now be run in the Parallel Slaves instead of serial execution through the Query Coordinator.

Note that depending on the plan shape and SQL features used, it's still possible that 12c reverts to the old serial FILTER subquery evaluation plan shape, so the new feature doesn't get used always.

There is more to say about this feature. In the next part of this instalment I'll focus on the different distribution methods possible with the new parallel FILTER operator - there is a new PQ_FILTER hint that allows controlling the distribution, but there are also some interesting points to make about how the optimizer seems to make its choice which distribution method to use automatically. In the examples shown here there's no separate distribution for the FILTER, by the way, but this can look differently, as I'll show in the next part.

Video Tutorial: XPLAN_ASH Active Session History - Part 6

Sun, 2015-06-28 16:30
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

12c Parallel Execution New Features: 1 SLAVE distribution

Mon, 2015-06-22 15:39
When certain SQL features get used in pre-12c versions that force non-parallel evaluation, like using ROWNUM or certain Analytic Functions like LAG/LEAD, then - depending on the overall plan shape - Oracle can start to decompose the parallel execution plan into several so called DFO trees (If you want learn more about DFO trees and DFOs I recommend watching my online tutorial on my Youtube channel).

Now having multiple DFO trees in a single parallel execution plan comes with several side effects that are confusing and complicate matters unnecessarily, like each DFO tree allocates its own PX slave set(s), and so each one can potenially end up with a different DOP, which means you can have more than one DOP in a single parallel execution plan.

Depending on the overall plan shape this might also mean that a DFO tree can get started multiple times, and again this means that each time it is started / completed PX slaves need to be allocated and de-allocated, potentially causing a significant overhead coordinating all that activity that is not directly related to the actual execution.

This also means that having multiple DFO trees can lead to a situation where (a lot) more PX slaves are allocated than expected, in case multiple DFO trees are active at the same time - which again means that if you believe you can limit the number of PX slaves allocated by a single parallel execution using Resource Manager directives you might be wrong.

Since all these are undesirable side effects, starting with release 12c Oracle has put effort into new features that minimize the need for such a decomposition into multiple DFO trees. One of these new features is the so called "1 SLAVE" distribution method that can get used if such a non-parallel evaluation is required.

Quite similar to the recently described "PX SELECTOR" operator the "1 SLAVE" distribution uses a single PX slave out of a slave set to execute the non-parallel operations instead of the Query Coordinator. The main difference this makes is that the parallel and serial operations now are still part of the same DFO tree instead of having parts of the execution plan executed by the Query Coordinator and different DFO trees before and after such serial operations.

Let's have a look at a simple example to demonstrate the new feature. I use here three identical tables, just for the sake of being able to differentiate the tables in the plan output - in principle re-using a single table three times would be sufficient.

create table t2
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't2')

create table t4
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't4')

create table t6
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't6')

explain plan for
select /*+ no_merge(x) */
*
from
(
select /*+ parallel(t6 4)
--optimizer_features_enable('11.2.0.4')
*/
*
from
(
select /*+ parallel(t2 4) */
--lag(id) over (order by id) as v1_rn
rownum as v1_rn
, t2.id as v1_id
, t2.filler as v1_filler
from
t2
) v1
, (
select /*+ parallel(t4 2) */
--lag(id) over (order by id) as v2_rn
rownum as v2_rn
, t4.id as v2_id
, t4.filler as v2_filler
from
t4
) v2
, t6
where
v1_id = v2_id
and v1_id = t6.id
) x
where
rownum > 1
;

-- 11.2.0.4 plan shape
----------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | COUNT | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ30002 | Q3,02 | P->S | QC (RAND) |
| 5 | VIEW | | Q3,02 | PCWP | |
|* 6 | HASH JOIN | | Q3,02 | PCWP | |
| 7 | PX RECEIVE | | Q3,02 | PCWP | |
| 8 | PX SEND HASH | :TQ30001 | Q3,01 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | Q3,01 | PCWC | |
| 10 | TABLE ACCESS FULL | T6 | Q3,01 | PCWP | |
| 11 | BUFFER SORT | | Q3,02 | PCWC | |
| 12 | PX RECEIVE | | Q3,02 | PCWP | |
| 13 | PX SEND HASH | :TQ30000 | | S->P | HASH |
|* 14 | HASH JOIN | | | | |
| 15 | VIEW | | | | |
| 16 | COUNT | | | | |
| 17 | PX COORDINATOR | | | | |
| 18 | PX SEND QC (RANDOM)| :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 19 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 20 | TABLE ACCESS FULL| T2 | Q1,00 | PCWP | |
| 21 | VIEW | | | | |
| 22 | COUNT | | | | |
| 23 | PX COORDINATOR | | | | |
| 24 | PX SEND QC (RANDOM)| :TQ20000 | Q2,00 | P->S | QC (RAND) |
| 25 | PX BLOCK ITERATOR | | Q2,00 | PCWC | |
| 26 | TABLE ACCESS FULL| T4 | Q2,00 | PCWP | |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM>1)
6 - access("V1_ID"="T6"."ID")
14 - access("V1_ID"="V2_ID")

-- 12.1.0.2 plan shape
---------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | COUNT | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10004 | Q1,04 | P->S | QC (RAND) |
| 5 | VIEW | | Q1,04 | PCWP | |
|* 6 | HASH JOIN BUFFERED | | Q1,04 | PCWP | |
| 7 | PX RECEIVE | | Q1,04 | PCWP | |
| 8 | PX SEND HASH | :TQ10002 | Q1,02 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | Q1,02 | PCWC | |
| 10 | TABLE ACCESS FULL | T6 | Q1,02 | PCWP | |
| 11 | PX RECEIVE | | Q1,04 | PCWP | |
| 12 | PX SEND HASH | :TQ10003 | Q1,03 | S->P | HASH |
|* 13 | HASH JOIN BUFFERED | | Q1,03 | SCWC | |
| 14 | VIEW | | Q1,03 | SCWC | |
| 15 | COUNT | | Q1,03 | SCWP | |
| 16 | PX RECEIVE | | Q1,03 | SCWP | |
| 17 | PX SEND 1 SLAVE | :TQ10000 | Q1,00 | P->S | 1 SLAVE |
| 18 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 19 | TABLE ACCESS FULL| T2 | Q1,00 | PCWP | |
| 20 | VIEW | | Q1,03 | SCWC | |
| 21 | COUNT | | Q1,03 | SCWP | |
| 22 | PX RECEIVE | | Q1,03 | SCWP | |
| 23 | PX SEND 1 SLAVE | :TQ10001 | Q1,01 | P->S | 1 SLAVE |
| 24 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
| 25 | TABLE ACCESS FULL| T4 | Q1,01 | PCWP | |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM>1)
6 - access("V1_ID"="T6"."ID")
13 - access("V1_ID"="V2_ID")
Let's start with the 11.2.0.4 plan shape: We can see from multiple occurrences of the PX COORDINATOR operator, the TQ column and the corresponding TQ entries in the NAME column that this parallel execution plan consists of three DFO trees. When analyzing the actual run time activity it would also become obvious that the different DFO trees actually run at a different DOP and use different PX slave sets. In this particular case here it would even become obvious that two DFO trees are active at the same time - so it's pretty much demonstrating all the undesirable side effects of having multiple DFO trees that I mentioned above.

The HASH JOIN between V1 and V2 runs serially, whereas the HASH JOIN between T6 and the join result of V1 and V2 runs parallel. Since we have a Serial->Parallel distribution between these two HASH JOINs, an additional BUFFER SORT operation gets added - as outlined in the PX SELECTOR note.

If we now look at the 12.1.0.2 plan shape we notice that the execution plan consists of a single DFO tree, and hence all the side effects I've outlined are gone. The serial evaluation is done by a single PX slave of one set, again indicated by the SCWC/SCWP decorator, similar to the PX SELECTOR operator.

However, the plan shape also demonstrates one possible disadvantage of the new feature: Since now everything is part of a single DFO tree we end up with more re-distributions per DFO tree, and hence might have an increased demand for the dreaded BUFFERing to prevent multiple re-distributions being active at the same time - here this can be seen in operation ID 13 - the HASH JOIN executed serially by one PX slave is actually turned into a HASH JOIN BUFFERED (so the worst case for a (not really) "parallel" hash join - executed only by a single PX slave and needs to buffer the right hand row source in addition), and also the second HASH JOIN is turned into its BUFFERED variant. This means compared to the 11.2.0.4 plan shape that contains no HASH JOIN BUFFERED but an additional BUFFER SORT operation, the 12.1.0.2 plan shape with this data pattern actually requires more PGA / TEMP space than the 11.2.0.4 plan shape due to the double buffering now necessary.

Footnote
The new 1 SLAVE distribution doesn't get used always in 12c. If you run just the join between V1 and V2 for example, then the old plan shape will be used, and there are again multiple DFO trees. Furthermore, in this particular case, when you start changing the DOP used in the PARALLEL hints you also might end up with a plan shape where one view uses the 1 SLAVE distribution whereas the other one uses the old plan shape with Query Coordinator activity - I haven't investigated further why this happens.

If Analytic Functions get used, you might also see a "1 SLAVE (ORDER)" variation of the distribution that enforces a certain order when re-distributing the data, similar to the "PX SEND QC (ORDER)" operator.

Pages