Randolf Geist

Subscribe to Randolf Geist feed
Updated: 24 min 50 sec ago

Oracle Database Cloud (DBaaS) Performance Consistency - Part 3

Sun, 2016-08-28 16:43
This is the third part of this installment, comparing the performance consistency of the DBaaS cloud offering with a dedicated physical host. This time instead of burning CPU using logical I/O only (see part 2) the test harness executes a SQL statement that has to perform physical I/O mostly, and to be more precise this round is a read-only test, so doesn't include any modification to data (except the logging and any other background system generated write activity, like maintaining ASH /AWR data etc.).

In order to maximize the physical I/O part the database instance was configured with a minimum sized buffer cache (16k block size in this case) and the following script was executed as many times as CPUs were available:

declare
  n number;
begin
  loop
    select /*+
              leading(t_o)
              use_nl(t_i)
              index(t_o)
              index(t_i)
          */
          sum(t_i.n)
          into n
    from
          t_o
        , t_i&tabname t_i
    where
          t_o.id_fk = t_i.id;
    insert into timings(testtype, thread_id, ts) values ('&testtype', &thread_id, systimestamp);
    commit;
  end loop;
end;
/

The outer table T_O was created like this:

create table t_o (id primary key, id_fk)
organization index
as
select
       rownum as id
     , case mod(rownum, 2) + 1
       when 1
       then mod(rownum, &tab_size / 20)
       else &tab_size / 10 - mod(rownum, &tab_size / 20) + 1
       end as id_fk
from dual

connect by level <= &tab_size;

Each thread got its dedicated inner table of the Nested Loop join, created like this:

create table t_i' || i || ' (id not null, n, filler)
pctfree 99 pctused 1
tablespace &tbs
as
select cast(rownum as integer) as id,
cast(rownum as number) as n,
cast(rpad('x', 200) as varchar2(200)) as filler
from dual

connect by level <= &tab_size / 10;

create index t_i' || i || '_idx on t_i' || i || ' (id, filler) pctfree 99 tablespace &tbs;


Due to the way the data was crafted and thanks to the Nested Loop join batching performed by Oracle this ensured that the sessions were performing "db file parallel read" I/O as much and hard as possible, so were submitting multiple I/O requests using a single call and taking advantage of asynchronous I/O where possible.

When running with 8 threads on those 8 CPUs on the DBaaS service this resulted in approx. 30.000 IOPS for those 16K blocks (approx. 480MB / sec throughput). Note that I got these 30.000 IOPS for 16K blocks also on a different instance with just 4 CPUs (2 OCPUs) and 4 threads, so it looks like this is a common 16K blocks IOPS limit for a DBaaS instance independently from the CPU count without further tweaking (I didn't attempt to optimize this in any way but used the service as provided / configured by the Oracle Cloud). It looks like this is more a throughput limitation at 480MB / sec than an IOPS limitation, because when using a 4K blocksize I was able to get 60.000 IOPS (middle value), but the IOPS rate was much more fluctuating (between 20.000 and 90.000) whereas the 30.000 IOPS using 16K blocksize was pretty much stable.

On the physical host I got approx. half of these IOPS (roughly 15.500) for those 16K blocks (approx. 250MB / sec throughput), which is very likely the throughput limit for the rather outdated hardware configuration, albeit using a rather modern Samsung SSD drive, but also meant that running at 4 threads I got a comparable number of IOPS per thread as in the DBaaS setup. Using a 4K blocksize the physical host maxed out at (very stable) 52.000 IOPS.

The overall results look like this:

DBaaS:



Physical host:




Again the graph is supposed to show how many of the runs deviated how much from the overall median runtime. Similar to the previous, logical I/O only test this time the DBaaS service shows a significantly different profile, with an even larger spread of deviation up to almost 9 percent from the median runtime, whereas the physical host only shows significant deviation up to 2.5 percent.

Again, like in the previous test, the physical host shows more extreme outliers than the DBaaS service.

The same graph on a per day basis this time shows significant differences between the days for the DBaaS service:
The physical host shows a very common pattern, except for the first day:
Looking at the individual performance of each thread the DBaaS shows a significant variation in performance per day:
The physical host shows a pretty consistent performance pattern, but interestingly the different threads show different, but consistent runtimes:
The next test round will include physical writes.

Adaptive Cursor Sharing Fail

Sun, 2016-08-21 10:34
Here is another example (besides the fact that Adaptive Cursor Sharing only gets evaluated during a PARSE call (still valid in 12c) and supports a maximum of 14 bind variables) I've recently come across at a client site where the default implementation of Adaptive Cursor Sharing fails to create a more suitable execution plan for different bind variable values.

Broken down to a bare minimum the query was sometimes executed using non-existing values for a particular bind variable, but other times these values were existing and very popular. There were two suitable candidate indexes and one of them appeared to the optimizer more attractive in case of the "non-existing" value case. Unfortunately this index was a very bad choice for the "existing and very popular" case.

The critical point of the case is that due to other, additional filters on the same table/index the final number of rows produced by the row source operation was (almost) the same for the "non-existing" and the "existing and very popular" case, but in the latter case a large proportion of the bad index had to be scanned due to the popularity of the bind value, which was the only predicate that could be used as ACCESS predicate and the additional ones could only be applied as FILTER on the index.

So although there was a suitable histogram available on the column in question and the optimizer in principle was aware of the different popularities of values and marked the cursor as "bind sensitive" it looks like when a cursor is then monitored it just is assigned to different buckets as exposed via V$SQL_CS_HISTOGRAM, based on the rows produced by the row source. Since the operation in question didn't differ much in the final number of rows produced between the two cases (either 0 rows or 1 row) the different executions were all assigned to the same bucket - although the amount of work required to produce this single row was huge, reflected in execution time and physical / logical I/Os performed per execution.

Obviously this vastly different amount of work performed to produce a similar number of rows is not used as criteria for marking a cursor as "bind aware" and evaluate different execution plans based on ranges of bind values.

There is a view V$SQL_CS_STATISTICS that includes information about CPU_TIME and BUFFER_GETS but in 11.2.0.4 it only reflects the initial execution when starting with the "non-existing" value case, but not the slow executions afterwards. In 12.1.0.2 the view is no longer populated at all, which suggests that this information is simply not used for deciding the "bind aware" status of a cursor.

Discussing the case with Mohamed Houri, who has spent significant time on investigating the "Adaptive Cursor Sharing" feature seemed to confirm this assumption.

Here is a simple test case that allows reproducing the issue:

-- FILTER1 is highly skewed (here one very popular value -1)
create table t
as
select
rownum as id
, rownum as n
, case when rownum <= 100000 then rownum else -1 end as filter1
, rownum as filter2
, rpad('x', 200) as filler1
, rownum as filter3
, rpad('x', 200) as filler2
from
dual
connect by level <= 1e6;

-- Histogram on FILTER1 should make the sample query "bind sensitive"
exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1 for columns size 254 filter1')

-- The "good" index, ACCESS on all three columns
-- But higher BLEVEL (3) due to FILLER1 only one value and wide - needs to be part of every branch block
create index t_idx1 on t (filter1, filler1, filter2) compress 2 /*advanced low*/;

-- The "bad" index, ACCESS only on first column, FILTER on subsequent columns
-- But lower BLEVEL (2) due to FILTER3 before FILLER1, leads to much more compact branch blocks
create index t_idx2 on t (filter1, filter3, filler1, filler2, filter2) compress 1 /*advanced low*/;
The critical point is that the good index has a higher BLEVEL than the bad index. So in case a non-existing value for FILTER1 gets used the optimizer will favor the index with the lower BLEVEL, which is for that case the key cost component:

set echo on

-- Non-existing value
-- T_IDX2 gets preferred
explain plan for
select sum(n) from t where filter1 = -999999 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 216 | 4 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 216 | | |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| T | 1 | 216 | 4 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T_IDX2 | 1 | | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("FILTER1"=(-999999) AND "FILLER1"='x

' AND
"FILTER2"=999999)
filter("FILTER2"=999999 AND "FILLER1"='x

')

-- Non-existing value
-- T_IDX1 has a higher cost
explain plan for
select /*+ index(t t_idx1) */ sum(n) from t where filter1 = -999999 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 216 | 5 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 216 | | |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| T | 1 | 216 | 5 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T_IDX1 | 1 | | 4 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("FILTER1"=(-999999) AND "FILLER1"='x

' AND
"FILTER2"=999999)

-- Existing and very popular value
-- T_IDX1 gets preferred
explain plan for
select sum(n) from t where filter1 = -1 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 216 | 5 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 216 | | |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| T | 1 | 216 | 5 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T_IDX1 | 1 | | 4 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("FILTER1"=(-1) AND "FILLER1"='x

' AND
"FILTER2"=999999)


-- Existing and very popular value
-- T_IDX2 is a bad idea, correctly reflected in the cost estimate
explain plan for
select /*+ index(t t_idx2) */ sum(n) from t where filter1 = -1 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 216 | 58269 (1)| 00:00:03 |
| 1 | SORT AGGREGATE | | 1 | 216 | | |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| T | 1 | 216 | 58269 (1)| 00:00:03 |
|* 3 | INDEX RANGE SCAN | T_IDX2 | 1 | | 58268 (1)| 00:00:03 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("FILTER1"=(-1) AND "FILLER1"='x

' AND
"FILTER2"=999999)
filter("FILTER2"=999999 AND "FILLER1"='x

')
So when optimizing for the two different cases the optimizer does the right thing and understands that for the case "existing and very popular" T_IDX2 is a bad choice. It's also obvious from the "Predicate Information" section that the index T_IDX2 only can use FILTER1 as ACCESS predicate.

But when using bind variables the different cases are not recognized and the bad index is used for both cases when the optimization is based on the "non-existing value" case:

-- Default behaviour, although being BIND_SENSITIVE thanks to the histogram, no ACS kicks in
-- The V$SQL_CS_HISTOGRAM shows all executions being in bucket 0, so according to BIND_SENSITIVE monitoring no need for action
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
res number;
begin
select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
res number;
begin
for i in 1..1000 loop
select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end loop;
end;
/

SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.03
SQL>
SQL> variable n number
SQL>
SQL> exec :n := -999999

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
2 res number;
3 begin
4 select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.28
SQL>
SQL> exec :n := -1

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> declare
2 res number;
3 begin
4 for i in 1..1000 loop
5 select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
6 end loop;
7 end;
8 /

PL/SQL procedure successfully completed.

Elapsed: 00:04:48.64
SQL>

-- With BIND_AWARE a second child cursors gets generated and used on second parse, interesting
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
res number;
begin
select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
res number;
begin
for i in 1..1000 loop
select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end loop;
end;
/

SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.03
SQL>
SQL> variable n number
SQL>
SQL> exec :n := -999999

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
2 res number;
3 begin
4 select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> exec :n := -1

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
2 res number;
3 begin
4 for i in 1..1000 loop
5 select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
6 end loop;
7 end;
8 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.20
SQL>

-- Or hinting for a fixed plan helps, too
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
res number;
begin
select /*+ index(t t_idx1) */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
res number;
begin
for i in 1..1000 loop
select /*+ index(t t_idx1) */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end loop;
end;
/
Executing the unhinted, default case leads to a very inefficient execution of the thousand executions using the popular value (and the actual table / index at the client site was much larger than this example and the query got executed very frequently) - compare that to the execution time for the other two variants, either using the BIND_AWARE hint or simply requesting the good index.

So when requesting to mark the cursor immediately BIND_AWARE via the corresponding hint (only possible if the cursor qualifies to be bind sensitive) Oracle happily generates a second child cursor at the second PARSE call with a different, more efficient execution plan for the popular value, because when a cursor is marked bind aware a completely different algorithm gets used that is based on range of values as exposed via V$SQL_CS_SELECTIVITY, and since the different values are falling into different ranges of values a new plan gets generated that is different from the previous one and hence gets used from then on for those (range of) values.

Nested Loop Join Physical I/O Optimizations

Mon, 2016-08-15 07:44
Having done my mini-series on Nested Loop join logical I/O optimizations a while ago I unfortunately never managed to publish anything regarding the Nested Loop join physical I/O optimizations, which are certainly much more relevant to real-life performance.

Therefore the main purpose of this blog post is to point you to Nikolay Savvinov's (whose blog I can recommend in general) great mini-series covering various aspects of these optimizations:

Part 1
Part 2
Part 3
Summary

One point that - at least to me - isn't entirely clear when reading Nikolay's series is which specific plan shape he refers to, in particular since in 12c even more plan shapes for a Nested Loop join are possible.

Hence here is an attempt to summarize the various two table Nested Loop join plan shapes as of 12c and what kind of physical I/O optimizations one can expect from them:

1. Nested Loop Join Batching, the default in most cases since 11g and also in 12c

-----------------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | NESTED LOOPS | |
| 3 | NESTED LOOPS | |
| 4 | INDEX FULL SCAN | SYS_IOT_TOP_97632 |
|* 5 | INDEX RANGE SCAN | T_I6_IDX |
| 6 | TABLE ACCESS BY INDEX ROWID| T_I6 |
-----------------------------------------------------------
This is the plan shape that Nikolay calls "batch" or "batching" in his posts. It can provide batched I/O (mainly "db file parallel read" issuing multiple I/O requests in a single call, potentially asynchronous I/O) on both the INDEX RANGE SCAN as well as the TABLE ACCESS BY ROWID operation as described in Nikolay's blog post.

Please note that I deliberately write "can provide batched I/O", because, as Nikolay also points out in his posts, the runtime engine monitors the execution and can dynamically adjust the behaviour, which also means that it might decide to use conventional single block reads ("db file sequential read").

Please also note that in addition to the "db file parallel read" operation that submits multiple I/O requests in a single I/O submit call (see Frits Hoogland's blog post from some time ago about Linux internals of this I/O operation) the runtime engine might use "db file scattered read" multi-block I/Os under certain circumstances, in particular when performing "cache warmup prefetching".

Also note that as Nikolay points out when enabling SQL Trace or "rowsource statistics" the "Batched I/O" optimization for some reason gets disabled.

Also this plan shape at least in 12c seems to lead to inconsistencies in the Real-Time SQL Monitoring in several ways. First the number of "Executions" for the INDEX RANGE SCAN and TABLE ACCESS component of the inner row source is not necessarily consistent with the number of rows in the driving row source, second almost all activity and I/O volume seems to be contributed to the "INDEX RANGE SCAN" plan operation and not the "TABLE ACCESS" operation, even if it's the "TABLE ACCESS" that causes physical I/O.

2. Nested Loop Join Prefetch plan shape including BATCHED ROWID table access (only from 12c on)

------------------------------------------------------------------
| Id | Operation | Name |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| T_I5 |
| 3 | NESTED LOOPS | |
| 4 | INDEX FULL SCAN | SYS_IOT_TOP_97716 |
|* 5 | INDEX RANGE SCAN | T_I5_IDX |
------------------------------------------------------------------
This is the Nested Loop prefetching plan shape introduced in Oracle 9i combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c. It's the plan shape Nikolay refers to with "BATCHED ROWID" in his posts. In my (and Nikolay's) tests this provides the most aggressive batching of I/O (highest number of I/O requests submitted per call in "db file parallel read") for the TABLE ACCESS BY ROWID BATCHED, but didn't perform the same on the index ("db file sequential read", single block reads on the index segment), which doesn't mean that different test and data setups might provide that, too.
Note you should get this plan shape only with explicit hinting or disabling Nested Loop Join Batching via parameter.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
--batch_table_access_by_rowid(b)
The "batch_table_access_by_rowid" isn't strictly necessary since it's the default behaviour in 12c.

Other variants are possible, like "use_nl_with_index" instead of "use_nl" and "index" separately or "opt_param('_nlj_batching_enabled', 0)" instead of "no_nlj_batching" to disable the batching (but then batching is disabled for the whole execution plan, not just for a particular join).

3. Classic Nested Loop Join Prefetch introduced in 9i

----------------------------------------------------------
| Id | Operation | Name |
----------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | TABLE ACCESS BY INDEX ROWID| T_I5 |
| 3 | NESTED LOOPS | |
| 4 | INDEX FULL SCAN | SYS_IOT_TOP_97716 |
|* 5 | INDEX RANGE SCAN | T_I5_IDX |
----------------------------------------------------------
This is what you get in pre-12c when preventing Nested Loop Join Batching, in 12c the BATCHED table access needs to be disabled in addition.

This plan shape provides the less aggressive table prefetching as described in Nikolay's posts (he refers to this plan shape as "Prefetch"), maximum number of requests submitted per call in a "db file parallel read" operation seems to 39.

As mentioned above, it didn't provide index prefetching in my tests.

The session statistics don't mention "Batched I/O", so although the "db file parallel read" wait event is the same the internal implementation obviously is a different code path.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

4. Classic Nested Loop Join plan shape with BATCHED ROWID table access (only from 12c on)

-------------------------------------------------------------------
| Id | Operation | Name |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | NESTED LOOPS | |
| 3 | INDEX FULL SCAN | SYS_IOT_TOP_97716 |
| 4 | TABLE ACCESS BY INDEX ROWID BATCHED| T_I5 |
|* 5 | INDEX RANGE SCAN | T_I5_IDX |
-------------------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i) combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c.

In my tests it provides only "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID BATCHED can make use of "db file parallel read" to batch these I/O requests to the table blocks, and the session statistics show "Batched I/O" counters increasing.

However, it doesn't attempt to optimize "inter-loop" / multiple loop iterations, so if each loop iteration via the INDEX RANGE SCAN only points to a single table block, only single block reads ("db file sequential read") on the TABLE ACCESS BY INDEX ROWID BATCHED can be seen.

In my tests this plan shape didn't provide index prefetching / I/O optimizations and performed single block reads ("db file sequential read") on the INDEX RANGE SCAN operation.

For a two table join it needs explicit hinting in 12c to arrive at this plan shape, but it is the default plan shape for the "inner" joins in case of nested Nested Loop joins (multiple, consecutive Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
--batch_table_access_by_rowid(b)
See my older posts regarding Nested Loop join logical I/O optimizations why the combination of OPT_PARAM nd NO_NLJ_PREFETCH is required to arrive at this plan shape - in short, specifying the obvious NO_NLJ_PREFETCH plus NO_NLJ_BATCHING doesn't work. Other variants as above.

5. Classic Nested Loop Join plan shape

-----------------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | NESTED LOOPS | |
| 3 | INDEX FULL SCAN | SYS_IOT_TOP_97716 |
| 4 | TABLE ACCESS BY INDEX ROWID| T_I5 |
|* 5 | INDEX RANGE SCAN | T_I5_IDX |
-----------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i).

Interestingly in my tests at least in 12c it provides also "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID can make use of "db file parallel read" to submit multiple of these I/O requests to the table blocks in a single call, but the session statistics don't show "Batched I/O" counters increasing and the maximum number of requests seem to be 39, so it is less aggressive and looks very similar to the internal implementation used for the classic "table prefetching" plan shape above.

There is no sign of "inter-loop" optimizations and the INDEX RANGE SCAN seems to make use of single block reads only ("db file sequential read").

For pre-12c this is default plan shape used for the "inner" joins in case of nested Nested Loop joins (multiple Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

Multiple, consecutive Nested Loop Joins
Another point that I think it is important to mention when describing these "inter-loop" prefetching and batching Nested Loop join optimization techniques is that they only apply to the "outer-most" Nested Loop join in case of multiple, consecutive Nested Loop joins, which makes their "game changing" character that Nikolay mentions in his posts less strong to me.

For example, this is the 12c default shape of a four table join using three Nested Loop joins:

----------------------------------------------------------------------
| Id | Operation | Name |
----------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | NESTED LOOPS | |
| 3 | NESTED LOOPS | |
| 4 | NESTED LOOPS | |
| 5 | NESTED LOOPS | |
| 6 | INDEX FULL SCAN | SYS_IOT_TOP_97632 |
| 7 | TABLE ACCESS BY INDEX ROWID BATCHED| T_I1 |
|* 8 | INDEX RANGE SCAN | T_I1_IDX |
| 9 | TABLE ACCESS BY INDEX ROWID BATCHED | T_I2 |
|* 10 | INDEX RANGE SCAN | T_I2_IDX |
|* 11 | INDEX RANGE SCAN | T_I3_IDX |
| 12 | TABLE ACCESS BY INDEX ROWID | T_I3 |
----------------------------------------------------------------------
As it can be seen only the outer-most Nested Loop joins gets the "batching" plan shape and can benefit from the optimizations described above. The inner Nested Loop joins show the classic plan shape, in case of 12c with the ROWID BATCHED option, so they can only benefit from the "intra-loop" optimizations described above, if a single loop iteration INDEX RANGE SCAN points to several table blocks.

Note that even when using explicit, additional NLJ_BATCHING hints for the inner tables joined I wasn't able to enforce any other plan shape.

If there are multiple blocks of (consecutive) Nested Loop joins (for example a HASH JOIN in between), then in each block the outer-most Nested Loop join gets the optimization, so multiple of those are possible per execution plan:

---------------------------------------------------------------------
| Id | Operation | Name |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | NESTED LOOPS | |
| 3 | NESTED LOOPS | |
|* 4 | HASH JOIN | |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| T_I2 |
| 6 | INDEX FULL SCAN | T_I2_IDX |
| 7 | NESTED LOOPS | |
| 8 | NESTED LOOPS | |
| 9 | INDEX FULL SCAN | SYS_IOT_TOP_97632 |
|* 10 | INDEX RANGE SCAN | T_I1_IDX |
| 11 | TABLE ACCESS BY INDEX ROWID | T_I1 |
|* 12 | INDEX RANGE SCAN | T_I3_IDX |
| 13 | TABLE ACCESS BY INDEX ROWID | T_I3 |
---------------------------------------------------------------------
As it can be seen now that T_I2 is joined via a HASH JOIN both Nested Loop joins to T_I1 and T_I3 get the "batched" optimization plan shape.

I don't know why this limitation is there that only the outer-most in a block of Nested Loop joins gets the optimized plan shape, but it certainly can reduce the possible benefit.

Since I don't think there is a costing difference on CBO level for the different Nested Loop join plan shapes (so the decision which shape to use isn't cost-driven) it might be possible to see a significant performance difference depending on which join gets the optimization - there might be room for improvement by manually influencing the join order in case of multiple, consecutive Nested Loop joins.

Oracle Database Cloud (DBaaS) Performance Consistency - Part 2

Sun, 2016-08-07 06:30
This is the second part of this installment, comparing the performance consistency of the DBaaS cloud offering with a dedicated physical host. This time instead of burning CPU using a trivial PL/SQL loop (see part 1) the test harness executes a SQL statement that performs logical I/O only, so no physical I/O involved.

In order to achieve that a variation of Jonathan Lewis' good old "kill_cpu" script got executed. In principle each thread performed the following:

define tabname = &1

define thread_id = &1;

alter session set "_old_connect_by_enabled" = true;

declare
  n number;
begin
  loop
    select
   count(*) X
    into    n
    from
   kill_cpu&tabname
    connect by
   n > prior n
    start with
   n = 1;
    insert into timings(testtype, thread_id, ts) values ('SQLLIO', &thread_id, systimestamp);
    commit;
  end loop;
end;
/

Each thread got its own, exclusive "kill_cpu<n>" table, so this setup attempted to avoid competition for buffer cache latches. The tables (in fact IOTs) were created like the following:

create table kill_cpu<n>(n primary key)
organization index
as
select
rownum n
from
all_objects
where
rownum <= 25
;

Again there were as many threads started as CPUs available - which meant 8 threads for the DBaaS environment and 4 threads for the physical host and again this was left running for several days.
The overall results look like this:
DBaaS:
Physical host:
Again the graph is supposed to show how many of the runs deviated how much from the overall median runtime. In contrast to the previous, simple PL/SQL test this time the DBaaS service shows a significantly different profile with a larger spread of deviation up to 3 percent from the median runtime, whereas the physical host only shows significant deviation up to 1.5 percent.
It's interesting to note however, that the physical host this time shows more extreme outliers than the DBaaS service, which was the other way around in the previous PL/SQL test.

The same graph on a per day basis doesn't show too significant differences between the days for either environment (here DBaaS):
Physical host:
Looking at the individual performance of each thread again the DBaaS shows a similar behaviour than last time - the different threads show a slightly different performance, and they also get slightly slower towards the end of the measurement:
And again the physical host shows a more consistent performance between threads:
The next test round will be a physical I/O bound setup.

DBMS_STATS - Gather statistics on tables with many columns - 12c update

Wed, 2016-08-03 12:33
This is just a short 12c update on my post about gathering statistics on tables with many columns from some time ago.

I'm currently investigating the "Incremental Statistics" feature in 12.1.0.2 for a client, which probably will be worth one or more other posts, but since we're still in the process of evaluating and installing various patches it's too early to write about that.

As part of the investigation I've noticed a significant change in behaviour in 12.1.0.2 compared to previous versions when it comes to gathering statistics on tables with many columns, hence this post here.

The key message of the original post was that DBMS_STATS needs potentially several passes when gathering statistics on tables with many columns, which meant a significant increase in overall work and resource consumption, exaggerated by the fact that tables with that many columns consist of multiple row pieces.

Along with other significant improvements in 12c (among others new types of histograms and the capability to derive some types of these histograms from a single pass with no need to run separate passes for each histogram) obviously now no longer multiple passes are required for such tables - Oracle can obviously now cope with up to and including 1000 columns in a single pass.

Repeating the test case from the original post for a table with 1.000 columns / 10.000 blocks and using a default METHOD_OPT setting of "FOR ALL COLUMNS SIZE AUTO" results in the following single query executed by the DBMS_STATS call:

SQL ID: 98828wcrfyn0c Plan Hash: 1962185829

select /*+ full(t) no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
xmlindex_sel_idx_tbl no_substrb_pad */to_char(count("COL968")),
substrb(dump(min("COL968"),16,0,64),1,240),substrb(dump(max("COL968"),16,0,
64),1,240),to_char(count("COL969")),substrb(dump(min("COL969"),16,0,64),1,
240),substrb(dump(max("COL969"),16,0,64),1,240),to_char(count("COL970")),
substrb(dump(min("COL970"),16,0,64),1,240),substrb(dump(max("COL970"),16,0,
64),1,240),to_char(count("COL971")),substrb(dump(min("COL971"),16,0,64),1,
240),substrb(dump(max("COL971"),16,0,64),1,240),to_char(count("COL972")),
.
.
.
from
"CBO_TEST"."MANY_X_COLS" t /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,
NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,
NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,
NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,
NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,
.
.
.
NDV,NIL,NIL,NDV,NIL,NIL*/


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 1.95 1.97 0 40011 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 1.95 1.98 0 40011 0 1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 111 (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
1 1 1 SORT AGGREGATE (cr=40011 pr=0 pw=0 time=1969578 us)
9995 9995 9995 OPTIMIZER STATISTICS GATHERING (cr=40011 pr=0 pw=0 time=751494 us cost=2717 size=240000 card=10000)
10000 10000 10000 TABLE ACCESS FULL MANY_X_COLS (cr=40011 pr=0 pw=0 time=413062 us cost=2717 size=240000 card=10000)
As it can be seen the "APPROXIMATE NDV AGGREGATE" operation introduced in 11g for the new statistics gathering mode has been renamed in 12c to "OPTIMIZER STATISTICS GATHERING".

Apart from that this is good news as it minimizes the work required to gather statistics for such tables - it takes the minimum of logical / physical I/Os to process the data. And as already mentioned 12c is also capable of more, like generating synopses for incremental statistics and histograms from such a single pass.

Video Tutorial: XPLAN_ASH Active Session History - Part 9

Sun, 2016-07-24 09:21
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

Video Tutorial: XPLAN_ASH Active Session History - Part 8

Sun, 2016-07-17 09:54
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.


Comparing Columns Containing NULL Values

Sun, 2016-07-10 06:32
Prompted by a (not really that) recent discussion on the OTN forum I've decided to publish this note.

Sometimes you have the task of comparing column values and handling the NULL value cases correctly makes this rather cumbersome for columns that are allowed to be NULL.

The "official" SQL way of comparing two column values and to find out whether they are equal or not - under the assumption that having NULL in both columns should be treated as equal (a point that can be argued) would read in SQL something like the following for the "unequal" case:

column1 != column2 or (column1 is null and column2 is not null) or (column1 is not null and column2 is null)
and

column1 = column2 or (column1 is null and column2 is null)
for the "equal" case.

Imagine this to be done for a lot of columns and it becomes obvious that people might be looking for more concise alternatives how to express this.

Sometimes you might have come across code that uses NVL or a similar function with some "special" value that is not supposed to be part of the column data to be compared - like -1 for (positive) integer values for example, but this has the inherent risk of producing wrong results if the "special" value is (or might become in future) not that special after all.

The not really documented SYS_OP_MAP_NONNULL function could come handy if it was official, because it covers the NULL cases automatically.

Note: I still would call it undocumented even if it shows up in some "Materialized View" related part of the official 12c documentation, because it is not part of the official SQL language manual

SYS_OP_MAP_NONNULL(column1) != SYS_OP_MAP_NONNULL(column2)
is all that is needed. More concise as above, but undocumented.

And: There is one obvious drawback to SYS_OP_MAP_NONNULL, since it adds one byte to the input value to ensure the mapped NULL value differs from any other value: It doesn't work properly with maxed out columns, for example this:

SQL> create table t
2 as
3 select rpad('x', 4000) as col1, rpad('x', 4000) as col2 from dual
4 union all
5 select rpad('x', 4000) as col1, rpad('y', 4000) as col2 from dual
6 union all
7 select null as col1, rpad('y', 4000) as col2 from dual
8 union all
9 select rpad('x', 4000) as col1, null from dual
10 union all
11 select null as col1, null as col2 from dual
12 ;

Table created.

SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't')

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from t where sys_op_map_nonnull(col1) = sys_op_map_nonnull(col2);
select count(*) from t where sys_op_map_nonnull(col1) = sys_op_map_nonnull(col2)
*
ERROR at line 1:
ORA-01706: user function result value was too large
errors out with "ORA-01706: user function result value was too large".

DECODE
Decode is an exception to the rule how Oracle treats NULLs in SQL - it returns the "equal" value if both operands to compare are NULL, so it allows the following construct to achieve above comparison (see above OTN forum discussion for credits where this expression appeared first, more recently here on Stew Ashton's blog):

For checking for inequality:

decode(column1, column2, 0, 1) = 1
For equality:

decode(column1, column2, 0, 1) = 0
Very concise, and officially documented.

Interestingly, starting with 11.2.0.2 there is an implicit re-write by default enabled that rewrites above expression (and only this variation, not the one testing for inequality):

decode(column1, column2, 0, 1) = 0
into

sys_op_map_nonnull(column1) = sys_op_map_nonnull(column2)
This is controlled via fix control 8551880 ("Rewrite decode predicate to join").

Guess what happens when running the following query in 11.2.0.2 or later against above data set (the fix control is enabled by default in these versions):

SQL> select /*+ opt_param('_fix_control', '8551880:0') */ count(*) from t where decode(col1, col2, 0, 1) = 0;

COUNT(*)
----------
2

SQL>
SQL> select /*+ opt_param('_fix_control', '8551880:1') */ count(*) from t where decode(col1, col2, 0, 1) = 0;
select /*+ opt_param('_fix_control', '8551880:1') */ count(*) from t where decode(col1, col2, 0, 1) = 0
*
ERROR at line 1:
ORA-01706: user function result value was too large
Finally, what about performance? For that purpose I created the following compressed 50M rows table, covering different cases:

create table t1 compress
as
select case mod(rownum, 5) + 1
when 1 then cast('C1' as varchar2(20))
when 2 then cast(null as varchar2(20))
when 3 then cast('C3' as varchar2(20))
when 4 then cast(null as varchar2(20))
when 5 then cast('C5' as varchar2(20))
end as col1,
case mod(rownum, 5) + 1
when 1 then cast('C2' as varchar2(20))
when 2 then cast(null as varchar2(20))
when 3 then cast('C3' as varchar2(20))
when 4 then cast('C4' as varchar2(20))
when 5 then cast(null as varchar2(20))
end as col2
from
(select /*+ cardinality(1e4) */ null from dual connect by level <= 1e4)
, (select /*+ cardinality(5e3) */ null from dual connect by level <= 5e3)
;
and got the following results from 11.2.0.4 (ignoring the different cardinality estimates for the moment, since I'm only interested in the elapsed time here):

SQL> select count(*) from t1;

Elapsed: 00:00:00.94

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18743 (2)| 00:03:45 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T1 | 50M| 18743 (2)| 00:03:45 |
-------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where decode(col1, col2, 0, 1) = 1;

Elapsed: 00:00:02.55

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19176 (4)| 00:03:51 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 500K| 1953K| 19176 (4)| 00:03:51 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(DECODE("COL1","COL2",0,1)=1)


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where SYS_OP_MAP_NONNULL("COL1")!=SYS_OP_MAP_NONNULL("COL2");

Elapsed: 00:00:03.04

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19515 (6)| 00:03:55 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 15M| 60M| 19515 (6)| 00:03:55 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(SYS_OP_MAP_NONNULL("COL1")<>SYS_OP_MAP_NONNULL("COL2"))


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where (col1 != col2 or (col1 is null and col2 is not null) or (col1 is not null and col2 is null));

Elapsed: 00:00:03.51

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 18876 (2)| 00:03:47 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 30M| 115M| 18876 (2)| 00:03:47 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COL1" IS NULL AND "COL2" IS NOT NULL OR "COL2" IS NULL
AND "COL1" IS NOT NULL OR "COL1"<>"COL2")



Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select /*+ opt_param('_fix_control', '8551880:0') */ count(*) from t1 where decode(col1, col2, 0, 1) = 0;

Elapsed: 00:00:02.49

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19176 (4)| 00:03:51 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 500K| 1953K| 19176 (4)| 00:03:51 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(DECODE("COL1","COL2",0,1)=0)


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select /*+ opt_param('_fix_control', '8551880:1') */ count(*) from t1 where decode(col1, col2, 0, 1) = 0;

Elapsed: 00:00:03.04

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 19515 (6)| 00:03:55 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 6000K| 22M| 19515 (6)| 00:03:55 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(SYS_OP_MAP_NONNULL("COL1")=SYS_OP_MAP_NONNULL("COL2"))


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from t1 where col1 = col2 or (col1 is null and col2 is null);

Elapsed: 00:00:02.30

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 18865 (2)| 00:03:47 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| T1 | 13M| 49M| 18865 (2)| 00:03:47 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COL1" IS NULL AND "COL2" IS NULL OR "COL1"="COL2")


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
68456 consistent gets
0 physical reads
0 redo size
346 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Surprisingly, in my tests SYS_OP_MAP_NONNULL always performed worst, and the DECODE expression pretty close to the more verbose SQL expression - depending on the data set sometimes even faster.

So using SYS_OP_MAP_NONNULL, and in particular the implicit rewrite might not be such a good idea after all. And now you know how the automatic rewrite could be disabled if desired.

New Version Of XPLAN_ASH Utility

Tue, 2016-06-28 18:23
A new version 4.23 of the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

This version comes only with minor changes, see the change log below.

Here are the notes from the change log:

- Finally corrected the very old and wrong description of "wait times" in the script comments, where it was talking about "in-flight" wait events but that is not correct. ASH performs a "fix-up" of the last 255 samples or so and updates them with the time waited, so these wait events are not "in-flight"

- Removed some of the clean up code added in 4.22 to the beginning of the script, because it doesn't really help much but spooled script output always contained these error messages about non-existent column definitions being cleared

- The "Concurrent I/O" sections will now also be skipped in LIMITED_ASH mode

- Some more fixes to the I/O figures in the "Activity Timeline based on ASH" - the spreading introduced in 4.22 needed some further refinement (see 4.22 change log for more details)

Combining Features - Wrong Results With Scalar Subquery Caching

Wed, 2016-06-08 15:39
Quite often you can get into trouble with Oracle when you start combining different features.

In this case of one my clients it is the combination of user-defined PL/SQL functions that can raise exceptions (think of currency conversion and a non-existent currency code gets passed into the function), DML error logging and attempting to improve performance by wrapping the PL/SQL function call into a scalar subquery to benefit from the built-in scalar subquery caching feature of the SQL runtime engine.

As long as the scalar subquery didn't get used everything worked as expected, but after adding the scalar subquery after some while it became obvious that wrong results occurred - in that particular case here it meant rows that should have been rejected and written to the error logging table due to the exception raised in the user-defined PL/SQL function suddenly showed up in the target table, and what was even more worrying - they included a converted amount, obviously using some arbitrary conversion rate, although the PL/SQL function actually didn't return any value but raised an exception for unknown currency codes.

The interesting point here is that everything works as expected if the error logging feature doesn't get used, or if the scalar subquery doesn't get used.

In case of no error logging the whole statement will roll back if an exception occurs, and that holds true and works even with scalar subquery caching in place.

Without scalar subquery caching the correct result gets produced, all rows that are supposed to be rejected are written to the error logging table and don't show up in the target table.

And if no exception gets raised (only valid and existing currency codes get used), again the results are correct in the target table even when using scalar subquery caching - so the caching of the values in principle works correctly.

Only when combining those features and dealing with invalid data the problem shows up, and as it can be demonstrated from a simple test case, even then it only occurs under certain circumstances.

Here is a simple test case for reproducing the issue:

set echo on timing on time on

drop table t_source purge;

drop table t_dest purge;

drop table ERR$_T_DEST purge;

drop package test_plsql_caching_with_except;

create table t_source
as
select
rownum as id
, cast('C' || to_char(round(trunc(rownum / 100000)) + 1, 'FM00') as varchar2(3)) as currency_code_clust
, cast('C' || to_char(mod(rownum, 10) + 1, 'FM00') as varchar2(3)) as currency_code_scat
, 100 as amount
from
(select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3) gen1
, (select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3) gen2
;

exec dbms_stats.gather_table_stats(null, 't_source')

create or replace package test_plsql_caching_with_except
as
function test_plsql_caching_with_except(s_currency_code in varchar2, s_do_excpt_on in varchar2 default 'C10') return number parallel_enable;
end;
/

create or replace package body test_plsql_caching_with_except
as
function test_plsql_caching_with_except(s_currency_code in varchar2, s_do_excpt_on in varchar2 default 'C10') return number parallel_enable
is
begin
if s_currency_code = s_do_excpt_on then
raise_application_error(-20001, 'exception encountered');
end if;
--
return to_number(substr(s_currency_code, 2));
end;
end;
/

create table t_dest (id number(*, 0), currency_code varchar2(3), org_amount number, conv_amount number);

exec DBMS_ERRLOG.create_error_log (dml_table_name => 't_dest')

truncate table t_dest;

truncate table ERR$_T_DEST;

alter session enable parallel dml;

-- Using "currency_code_scat" as function parameter gives correct result in T_DEST
-- But column CONV_AMOUNT in ERR$_T_DEST shows some function result that shouldn't be there since the function raised an exception (reproduces in 11.2.0.4 but not 12.1.0.2)
--
-- define ccy_code = currency_code_scat

-- Using "currency_code_clust" as function parameter gives wrong result in T_DEST - rows that should go to ERR$_T_DEST
-- due to exception being raised in PL/SQL function end up in T_DEST (with some "cached" function result applied)
-- There shouldn't be any rows in T_DEST with "currency_code = 'C10'"
-- Applies to both serial and Parallel Execution, conventional and direct path insert
define ccy_code = currency_code_clust

-- Use commented, non-cached function call to see correct results
insert /*+
append
no_parallel
-- parallel(n)
*/
into t_dest (id, currency_code, org_amount, conv_amount)
select id, &ccy_code, amount,
--test_plsql_caching_with_except.test_plsql_caching_with_except(&ccy_code) * amount
(select test_plsql_caching_with_except.test_plsql_caching_with_except(&ccy_code) * amount from dual)

from t_source
log errors
reject limit unlimited;

commit;

-- This is supposed to 900K
select count(*) from t_dest;

-- This is supposed to be 0
select count(*) from t_dest where currency_code = 'C10';

-- This is supposed to 100K
select count(*) from ERR$_T_DEST;

-- This is supposed to 0
select count(*) from ERR$_T_DEST where conv_amount is not null;
The second parameter to the PL/SQL function determines for which code an exception should be raised (or if at all), and hence allows reproducing different scenarios. By default it will raise an exception for code "C10" which represents 10% (100K rows) of the data in T_SOURCE. So 900K rows should end up in the destination table and those 100K rows having code "C10" should go into the error logging table.

Running the test case without the scalar subquery gets the correct result:

SQL> -- This is supposed to 900K
SQL> select count(*) from t_dest;

COUNT(*)
----------
900000

Elapsed: 00:00:00.70
SQL>
SQL> -- This is supposed to be 0
SQL> select count(*) from t_dest where currency_code = 'C10';

COUNT(*)
----------
0

Elapsed: 00:00:00.21
SQL>
SQL> -- This is supposed to 100K
SQL> select count(*) from ERR$_T_DEST;

COUNT(*)
----------
100000

Elapsed: 00:00:00.70
SQL>
SQL> -- This is supposed to 0
SQL> select count(*) from ERR$_T_DEST where conv_amount is not null;

COUNT(*)
----------
0

Elapsed: 00:00:00.04
SQL>
Use the scalar subquery, and suddendly instead of 100K rows that should be rejected only a single one gets rejected and 99.999 make it into the target table with some converted amount:

SQL> -- This is supposed to 900K
SQL> select count(*) from t_dest;

COUNT(*)
----------
999999

Elapsed: 00:00:00.23
SQL>
SQL> -- This is supposed to be 0
SQL> select count(*) from t_dest where currency_code = 'C10';

COUNT(*)
----------
99999

Elapsed: 00:00:00.20
SQL>
SQL> -- This is supposed to 100K
SQL> select count(*) from ERR$_T_DEST;

COUNT(*)
----------
1

Elapsed: 00:00:00.00
SQL>
SQL> -- This is supposed to 0
SQL> select count(*) from ERR$_T_DEST where conv_amount is not null;

COUNT(*)
----------
1

Elapsed: 00:00:00.01
SQL>
Now the critical point is the order of the column values in the column passed into the PL/SQL function. The problem can only be reproduced when using the "clustered" variant of the currency codes (column CURRENCY_CODE_CLUST in T_SOURCE). Switch to the "scattered" variant (CURRENCY_CODE_SCAT), and the (almost) correct result gets produced, at least in the target table. I say "almost" because looking closer at the rows written to the error logging table it becomes obvious that they contain a converted amount that definitely shouldn't be there - so the wrong result error already shows up here, since the row was written to the error logging table due to the PL/SQL function raising an exception, hence the converted amount should be NULL. This "wrong converted amount in the error logging table" no longer occurs in 12c (12.1.0.2), so something has changed but the problem of wrong results (rows with errors / exceptions go into the target table) is still reproducible in 12c.

The problem seems to be related to the way how the scalar subquery caching works when the same value comes again - Jonathan Lewis described this feature in "Cost Based Oracle" a long time ago. There seems to be special optimisation that avoids the cache lookup in case the same value as before is used again, like a "single value cache" on top of the actual cache. And it looks like it is this special treatment that doesn't handle correctly the case of the exception being raised in the PL/SQL function.

The test case also allows checking if other caching mechanisms like declaring the function "deterministic" (although in the real life case it isn't really hence shouldn't be used since it opens up other possibilities like building a Materialized View or a function-based index based on the PL/SQL function that potentially would lead to wrong results again then) or making use of the PL/SQL function result cache suffer from the same problem - and they don't. Both produce the correct results under all scenarios tested.

I have a more complex variant of the test case that also allows spotting more clearly the critical difference between the "scalar subquery / deterministic caching" that obviously happens in the SQL engine and the PL/SQL function result cache that is implemented in the PL/SQL engine by counting the number of times the function code gets actually executed and how many times the PL/SQL code block got executed according to the Library Cache information.

In this case here where the overhead of the actual function call is very small but it is called very frequently the PL/SQL function result cache adds much more overhead than the SQL engine caching, because the context switch to PL/SQL needs to be done for every function invocation, only to be then resolved in the PL/SQL function result cache instead of actually executing the function code.

Whereas the "scalar subquery / deterministic" cache avoids the context switch to PL/SQL and uses values from a cache maintained in the SQL engine.

So the lesser the overhead of the actual PL/SQL function code and the more frequent the function gets called, the more overhead the usage of the PL/SQL function result cache will produce in comparison to the SQL engine based caching mechanisms.

Of course the PL/SQL function result cache comes with other features that are not there with the others - like cross session / statement invocation caching that might make a big difference if a single function call causes a lot of overhead.

Oracle Database Cloud (DBaaS) Performance Consistency - Part 1

Sun, 2016-06-05 15:58
As Oracle ACE Director I got an extended trial license for Oracle's Cloud offerings, in particular the "Database as a Service" offering. As part of the (ongoing) evaluation I try to get an idea how consistent the performance of such an service is, which might be one of the concerns one might have when considering cloud offerings in general.

For my tests I've set up a 11.2.0.4 single instance database using "4 OCPUs" (Oracle CPUs) which ends up as an Oracle Linux 6 system showing 8 CPUs *and* 8 cores of type "Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz".

As edition for the database I've chosen the "Extreme Performance" Enterprise Edition which also shows up at the version banner (note the difference to regular database installations, not sure this might break some applications that don't know this banner):

BANNER
-------------------------------------------------------------------------
Oracle Database 11g EE Extreme Perf Release 11.2.0.4.0 - 64bit Production

I personally find the names of the different available editions not really a good choice - as far as I understand the different editions, although being called "Standard Edition", "Enterprise Edition", "Enterprise Edition - High Performance" and "Enterprise Edition - Extreme Performance" only differentiate in the number of available and usable options, like the Diagnostics and Tuning Pack license or the In-Memory Column Store (12c), but not in the basic performance profile, like CPU speed or I/O performance (although Oracle also has an Exadata Service Cloud offering for higher demands in terms of performance). Note that Oracle also offers a "High Memory" configuration that doubles the available memory for the same number of available CPU cores.
For evaluating the performance consistency I plan to run different flavors of tests on this test environment - the first and most simple one is a straight CPU burning PL/SQL loop spawned as many times as CPUs available (so eight times for this test system here). I deliberately disabled the database Resource Manager for this test.
For comparison I ran exactly the same test on a physical test system of my own to see how the two systems compare in terms of performance consistency. The physical test system is older and slower and only has four cores available, so the test script was only spawned four times in this case.
The loop executed n times concurrently was simply this, and this was kept running for several days to get also an idea if there are noticeable differences on different days of the week:

set echo on

define thread_id = &1;

declare
  n number;
begin
  loop
    n := 0;
    for i in 1..1000000000 loop
      n := n + 1;
    end loop;
    insert into timings(testtype, thread_id, ts) values ('PLSQL', &thread_id, systimestamp);
    commit;
  end loop;
end;
/

These were the results for the DBaaS service:
And these the results for the physical host:

The graph is supposed to show how many of the runs deviated how much from the overall median runtime, so the Y axis represents the percentage, and the X axis represents the deviation from the median runtime, for example 0.5 on the Y axis means 0.5 percent deviation from the median value
Looking at the results the following becomes obvious:
- The physical host provides a slightly more consistent performance (as expected), which means that it has 73% of the runs with 0.5 percent or less deviation from the median runtime, whereas the DBaaS service had "only" 64% - and it should be noted that I tried on that physical system to minimize any "noise" generated by other processes on the system. I didn't attempt to do the same for the DBaaS service, although in principle I could have tried because you get root access for this kind of service and can do whatever you want. For my physical host though in this case here it is interesting to see that there is a much more significant number of runs that deviate more than one percent from the median value, which is not the case for the DBaaS service. This also means if I change above analysis to the number of runs with 1 percent or less deviation from the median runtime, the DBaaS service (85%) actually is more consistent than my physical host (still 73%).
- The DBaaS service has some more and some more extreme outliers, but not really significant
- The overall performance consistency of the DBaaS service is pretty close to the physical host
The same graph on a per day basis (DBaaS service):
and physical host:

- The DBaaS service after a couple of days showed a very stable consistency pattern, only during the first four days the pattern was different:
First days:
Remaining days:
- For this test run no significant differences between different days could be spotted, except for the noticeable difference between the first few days and the remaining days
One interesting point is that in the DBaaS service the different threads showed consistent, but different runtimes, something that couldn't be seen for the physical host, where all threads showed similar runtimes. Whether this might have to do with some NUMA configuration or similar I can't tell, but it is at obvious for this test run - the Y axis represents the median duration per run per day:
DBaas service:
 Physical host:
The next test will be a CPU bound SQL operation that performs logical I/Os only, so no physical I/O involved, whereas as final test run I plan a physical I/O bound test set up.

SOUG Training Day May 2016 in Switzerland

Wed, 2016-02-24 12:27
I will be one of the speakers at the first SOUG performance training day, together with Christian Antognini, Franck Pachot and Clemens Bleile.

The event will take place in May this year in Switzerland at two different locations / days (one in German language, one in French, except mine, which will be in English).

My presentations will be:

- Analyzing and Troubleshooting Oracle Parallel Execution

- Advanced Oracle Troubleshooting

Hope to see you there!

Update: Here is the link to the official landing page of the event on the SOUG website.

Below is the official announcement published by SOUG with more details:


"Block the date ! SOUG performance training day the 18th and 19th of May exclusively in Switzerland"

In 2016 the Swiss Oracle User Group (SOUG) will introduce a new kind of event called the "SOUG training day".

During a whole day several world class speakers will teach the participants on a particular topic.

This year, Christian Antognini (Trivadis), Randolf Geist, Franck Pachot (dbi services) and Clemens Bleile (dbi services) will help the participant to manage the performances of their Oracle databases. The following topics will be addressed and explained in depth :

Strategies for Keeping Object Statistics Up-to-Date
Next-Generation Oracle Database - New Performance Features
Analyzing and Troubleshooting Oracle Parallel Execution
Advanced Oracle Troubleshooting
Interpreting AWR Reports - Straight to the Goal
All About Table Locks: DML, DDL, Foreign Key, Online Operations in regards to performance

The SOUG invites you to block the following dates to be sure that you do not miss such an incredible opportunity to meet the cracks :

- 18th of May in Olten - presentations in German
- 19th of May in Geneva - presentations in French (Mr Randolf Geist will present 2h in english)

This training day will be proposed to very attractive conditions :
- CHF 500.- for SOUG members
- CHF 1000.- for non-DOUG members

Advanced Oracle Troubleshooting - One Day Seminar, Moscow

Sun, 2016-02-14 16:05
I got invited by Luxoft Training to deliver my one day seminar "Advanced Oracle Troubleshooting" in Moscow end of March.

More details about what I cover in this seminar can be found here.

If you're interested, Luxoft Training has set up a page in Russian where you can find more details - but note that the seminar language will be English.

Big Nodes, Concurrent Parallel Execution And High System/Kernel Time

Sat, 2016-02-06 17:47
The following is probably only relevant for customers that run Oracle on big servers with lots of cores in single instance mode (this specific problem here doesn't reproduce in a RAC environment, see below for an explanation why), like one of my clients that makes use of the Exadata Xn-8 servers, for example a X4-8 with 120 cores / 240 CPUs per node (but also reproduced on older and smaller boxes with 64 cores / 128 CPUs per node).

They recently came up with a re-write of a core application functionality. Part of this code did start the same code path for different data sets potentially several times concurrently ending up with many sessions making use of Parallel Execution. In addition a significant part of the queries used by this code did make questionable use of Parallel Execution, in that sense that queries of very short duration used Parallel Execution, hence ending up with several Parallel Execution starts per second. You could see this pattern from the AWR reports like this, showing several "DFO trees" parallelized per second on average over an hour period:



When the new code was tested with production-like data volumes and patterns, in the beginning the CPU profile of such a big node (running in single instance mode) looked like this, when nothing else was running on that box:



As you can see, the node was completely CPU bound, spending most of the time in System/Kernel time. The AWR reports showed some pretty unusual PX wait events as significant:



"PX Deq: Slave Session Stats" shouldn't be a relevant wait event since it is about the PX slaves at the end of a PX execution passing an array of session statistics to the PX coordinator for aggregating the statistics on coordinator level. So obviously something was slowing down this PX communication significantly (and the excessive usage of Parallel Execution was required to see this happen).

Also some of the more complex Parallel Execution queries performing many joins and ending up with a significant number of data redistributions ran like in slow motion, although claiming to spend 100% of their time on CPU, but according to Active Session History almost 90% of that time was spent on the redistribution operations:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            98|             86|             87|


Running the same query with the same execution plan on the same data and the same box during idle times showed a almost 20 times better performance, and less than 40% time spent on redistribution:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            96|             38|             37|

So it looked like those queries ran into some kind of contention that wasn't instrumented in Oracle but happened outside on O/S level, showing up as CPU Kernel time - similar to what could be seen in previous versions of Oracle when spinning on mutexes.

Reducing the excessive usage of Parallel Execution showed a significant reduction in CPU time, but still the high System/Kernel time was rather questionable:



So the big question was - where was that time spent in the kernel to see whether this gives further clues.

Analysis
Running "perf top" on the node during such a run showed this profile:

  PerfTop:  129074 irqs/sec  kernel:76.4%  exact:  0.0% [1000Hz cycles],  (all, 128 CPUs)
-------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                 DSO
             _______ _____ ________________________ ___________________________________________________________

          1889395.00 67.8% __ticket_spin_lock       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            27746.00  1.0% ktime_get                /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            24622.00  0.9% weighted_cpuload         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            23169.00  0.8% find_busiest_group       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            17243.00  0.6% pfrfd1_init_locals       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            16961.00  0.6% sxorchk                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            15434.00  0.6% kafger                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            11531.00  0.4% try_atomic_semop         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            11006.00  0.4% __intel_new_memcpy       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10557.00  0.4% kaf_typed_stuff          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10380.00  0.4% idle_cpu                 /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             9977.00  0.4% kxfqfprFastPackRow       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             9070.00  0.3% pfrinstr_INHFA1          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8905.00  0.3% kcbgtcr                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8757.00  0.3% ktime_get_update_offsets /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             8641.00  0.3% kgxSharedExamine         /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             7487.00  0.3% update_queue             /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             7233.00  0.3% kxhrPack                 /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6809.00  0.2% rworofprFastUnpackRow    /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6581.00  0.2% ksliwat                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6242.00  0.2% kdiss_fetch              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6126.00  0.2% audit_filter_syscall     /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5860.00  0.2% cpumask_next_and         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5618.00  0.2% kaf4reasrp1km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5482.00  0.2% kaf4reasrp0km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5314.00  0.2% kopp2upic                /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5129.00  0.2% find_next_bit            /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             4991.00  0.2% kdstf01001000000km       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4842.00  0.2% ktrgcm                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4762.00  0.2% evadcd                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4580.00  0.2% kdiss_mf_sc              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle

Running "perf" on a number of Parallel Slaves being busy on CPU showed this profile:

     0.36%     ora_xxxx  [kernel.kallsyms]             [k] 

__ticket_spin_lock
               |
               --- __ticket_spin_lock
                  |          
                  |--99.98%-- _raw_spin_lock
                  |          |          
                  |          |--100.00%-- ipc_lock
                  |          |          ipc_lock_check
                  |          |          |          
                  |          |          |--99.83%-- semctl_main
                  |          |          |          sys_semctl
                  |          |          |          system_call
                  |          |          |          __semctl
                  |          |          |          |          
                  |          |          |           --100.00%-- skgpwpost
                  |          |          |                     kslpsprns
                  |          |          |                     kskpthr
                  |          |          |                     ksl_post_proc
                  |          |          |                     kxfprienq
                  |          |          |                     kxfpqrenq
                  |          |          |                     |          
                  |          |          |                     |--98.41%-- kxfqeqb
                  |          |          |                     |          kxfqfprFastPackRow
                  |          |          |                     |          kxfqenq
                  |          |          |                     |          qertqoRop
                  |          |          |                     |          kdstf01001010000100km
                  |          |          |                     |          kdsttgr
                  |          |          |                     |          qertbFetch
                  |          |          |                     |          qergiFetch
                  |          |          |                     |          rwsfcd
                  |          |          |                     |          qertqoFetch
                  |          |          |                     |          qerpxSlaveFetch
                  |          |          |                     |          qerpxFetch
                  |          |          |                     |          opiexe
                  |          |          |                     |          kpoal8

Running "strace" on those Parallel Slaves showed this:

.
.
.
semctl(1347842, 397, SETVAL, 0x1)       = 0
semctl(1347842, 388, SETVAL, 0x1)       = 0
semctl(1347842, 347, SETVAL, 0x1)       = 0
semctl(1347842, 394, SETVAL, 0x1)       = 0
semctl(1347842, 393, SETVAL, 0x1)       = 0
semctl(1347842, 392, SETVAL, 0x1)       = 0
semctl(1347842, 383, SETVAL, 0x1)       = 0
semctl(1347842, 406, SETVAL, 0x1)       = 0
semctl(1347842, 389, SETVAL, 0x1)       = 0
semctl(1347842, 380, SETVAL, 0x1)       = 0
semctl(1347842, 395, SETVAL, 0x1)       = 0
semctl(1347842, 386, SETVAL, 0x1)       = 0
semctl(1347842, 385, SETVAL, 0x1)       = 0
semctl(1347842, 384, SETVAL, 0x1)       = 0
semctl(1347842, 375, SETVAL, 0x1)       = 0
semctl(1347842, 398, SETVAL, 0x1)       = 0
semctl(1347842, 381, SETVAL, 0x1)       = 0
semctl(1347842, 372, SETVAL, 0x1)       = 0
semctl(1347842, 387, SETVAL, 0x1)       = 0
semctl(1347842, 378, SETVAL, 0x1)       = 0
semctl(1347842, 377, SETVAL, 0x1)       = 0
semctl(1347842, 376, SETVAL, 0x1)       = 0
semctl(1347842, 367, SETVAL, 0x1)       = 0
semctl(1347842, 390, SETVAL, 0x1)       = 0
semctl(1347842, 373, SETVAL, 0x1)       = 0
semctl(1347842, 332, SETVAL, 0x1)       = 0
semctl(1347842, 379, SETVAL, 0x1)       = 0
semctl(1347842, 346, SETVAL, 0x1)       = 0
semctl(1347842, 369, SETVAL, 0x1)       = 0
semctl(1347842, 368, SETVAL, 0x1)       = 0
semctl(1347842, 359, SETVAL, 0x1)       = 0
.
.
.

So the conclusion was: A lot of CPU time is spent spinning on the "spin lock" (critical code section) - caused by calls to "semctl" (semaphores), which are part of the PX code path and come from "ipc_lock"->"raw_lock". "strace" shows that all of the calls to "semctl" make use of the same semaphore set (first parameter), which explains the contention on that particular semaphore set (indicating that the locking granule is the semaphore set, not the semaphore).

Solution
Based on the "perf" results an Oracle engineer found a suitable, unfortunately unpublished and closed bug from 2013 for 12.1.0.2 that comes up with three different ways how to address the problem:

- Run with "cluster_database" = true: This will take a different code path which simply reduces the number of semaphore calls by two orders of magnitude. We tested this approach and it showed immediate relief on kernel time - that is the explanation why in a RAC environment this specific issue doesn't reproduce.

- Run with different "kernel.sem" settings: The Exadata boxes came with the following predefined semaphore configuration:

kernel.sem = 2048 262144 256 256

"ipcs" showed the following semaphore arrays with this configuration when starting the Oracle instance:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 12941057   oracle    640        1502     
0xd87a8935 12973826   oracle    640        1502     
0xd87a8936 12006595   oracle    640        1502    

By reducing the number of semaphores per set and increasing the number of sets, like this:

kernel.sem = 100 262144 256 4096

the following "ipcs" output could be seen:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 13137665   oracle    640        93       
0xd87a8935 13170434   oracle    640        93       
0xd87a8936 13203203   oracle    640        93       
0xd87a8937 13235972   oracle    640        93       
0xd87a8938 13268741   oracle    640        93       
0xd87a8939 13301510   oracle    640        93       
0xd87a893a 13334279   oracle    640        93       
0xd87a893b 13367048   oracle    640        93       
0xd87a893c 13399817   oracle    640        93       
0xd87a893d 13432586   oracle    640        93       
0xd87a893e 13465355   oracle    640        93       
0xd87a893f 13498124   oracle    640        93       
0xd87a8940 13530893   oracle    640        93       
0xd87a8941 13563662   oracle    640        93       
0xd87a8942 13596431   oracle    640        93       
0xd87a8943 13629200   oracle    640        93       
0xd87a8944 13661969   oracle    640        93       
0xd87a8945 13694738   oracle    640        93       
0xd87a8946 13727507   oracle    640        93       
0xd87a8947 13760276   oracle    640        93       
0xd87a8948 13793045   oracle    640        93       
0xd87a8949 13825814   oracle    640        93       
0xd87a894a 13858583   oracle    640        93       
0xd87a894b 13891352   oracle    640        93       
0xd87a894c 13924121   oracle    640        93       
0xd87a894d 13956890   oracle    640        93       
0xd87a894e 13989659   oracle    640        93       
0xd87a894f 14022428   oracle    640        93       
0xd87a8950 14055197   oracle    640        93       
0xd87a8951 14087966   oracle    640        93       
0xd87a8952 14120735   oracle    640        93       
0xd87a8953 14153504   oracle    640        93       
0xd87a8954 14186273   oracle    640        93       
0xd87a8955 14219042   oracle    640        93


So Oracle now allocated a lot more sets with less semaphores per set. We tested this configuration instead of using "cluster_database = TRUE" and got the same low kernel CPU times

- The bug comes up with a third option how fix this, which has the advantage that the host configuration doesn't need to be changed, and the configuration can be done per instance: There is an undocumented parameter "_sem_per_sem_id" that defines the upper limit of semaphores to allocate per set. By setting this parameter to some comparable values like 100 or 128 the net result ought to be the same - Oracle allocates more sets with less semaphores per set, but we haven't tested this option.

Conclusion
So the bottom line was this: Certain usage patterns of the Oracle instance lead to contention on spin locks on Linux O/S level if Oracle runs in single instance mode and used the so far recommended semaphore settings, which resulted in all semaphore calls going for the same semaphore set. By having Oracle allocate more semaphore sets the calls were spread over more sets hence significantly reducing the contention.

There is probably some internal note available at Oracle that indicates that the default semaphore settings recommended for big nodes are not optimal for running single instance mode under certain circumstances, but I don't know if there is a definitive, official guide available yet.

This is the CPU profile of exactly the same test workload as before using the changed "kernel.sem" settings:



Also in the AWR report the unusual PX related wait events went away and performance improved significantly, in particular also for those complex queries mentioned above.

DML Operations On Partitioned Tables Can Restart On Invalidation

Sun, 2016-01-17 13:12
It's probably not that well known that Oracle can actually rollback / re-start the execution of a DML statement should the cursor become invalidated. By rollback / re-start I mean that Oracle actually performs a statement level rollback (so any modification already performed by that statement until that point gets rolled back), performs another optimization phase of the statement on re-start (due to the invalidation) and begins the execution of the statement from scratch. Note that this can happen multiple times - actually it's possible to end up in a kind of infinite loop when this happens, leading to statements that can run for very, very long (I've seen statements on Production environments executing for several days although a single execution would only take minutes).

The pre-requisites to meet for this to happen are not that complex or exotic:

- The target table to manipulate needs to be partitioned

- The cursor currently executing gets invalidated - either by running DDL (typically think of partition related operations) - or simply by gathering statistics on one of the objects involved in the statement

- The DML statement hasn't touched yet one of the partitions of the target table but attempts to do so after the cursor got invalidated

When the last condition is met, the statement performs a rollback, and since it got invalidated - which is one of the conditions to be met - another optimization phase happens, meaning that it's also possible to get different execution plans for the different execution attempts. When the execution plan is ready the execution begins from scratch.

According to my tests the issue described here applies to both conventional and direct-path inserts, merge statements (insert / update / delete) as well as serial and parallel execution. I haven't explicitly tested UPDATE and DELETE statements, but the assumption is that they are affected, too.

The behaviour is documented in the following note on MOS: "Insert Statement On Partitioned Tables Is RE-Started After Invalidation (Doc ID 1462003.1)" which links to Bug "14102209 : INSERT STATEMENT' IS RESTARTING BY ITSELF AFTER INVALIDATION" where you can also find some more comments on this behaviour. The issue seems to be that Oracle at that point is no longer sure if the partition information compiled into the cursor for the partitioned target table is still correct or not (and internally raises and catches a corresponding error, like "ORA-14403: Cursor invalidation detected after getting DML partition lock", leading to the re-try), so it needs to refresh that information, hence the re-optimization and re-start of the cursor.

Note that this also means that the DML statement might already have performed modifications to other partitions but after being invalidated attempts to modify another partition it hasn't touched yet - it just needs an attempt to modify a partition not touched into yet by that statement.

It's also kind of nasty that the statement keeps running the potentially lengthy query part after being invalidated only to find out it needs to re-start after the first row is attempted to be applied to a target table partition not touched yet.

Note that applications typically run into this problem, when they behave like the following:

- There are longer running DML statements that take typically several seconds / minutes until they attempt to actually perform an modification to a partitioned target table

- They either use DBMS_STATS to gather stats on one of the involved tables, typically using NO_INVALIDATE=>FALSE, which leads to an immediate invalidation of all affected cursors

- And/Or they perform partition related operations on one of the tables involved, like truncating, creating or exchanging partitions. Note that it is important to point out that it doesn't matter which objects gets DDL / stats applied, so it's not limited to activity on the partitioned target table being modified - any object involved in the query can cause the cursor invalidation

In principle this is another variation of the general theme "Don't mix concurrent DDL with DML/queries on the same objects". Doing so is something that leads to all kinds of side effects, and the way the Oracle engine is designed means that it doesn't cope very well with doing so.

Here is a simple test case for reproducing the issue, using INSERTs in this case here (either via INSERT or MERGE statement):

create table t_target (
id number(*, 0) not null,
pkey number(*, 0) not null,
filler varchar2(500)
)
--segment creation immediate
partition by range (pkey) --interval (1)
(
partition pkey_0 values less than (1)
, partition pkey_1 values less than (2)
, partition pkey_2 values less than (3)
, partition pkey_3 values less than (4)
);

create table t_source
compress
as
select 1 as id, rpad('x', 100) as filler
from
(select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3),
(select /*+ cardinality(1e0) */ null from dual connect by level <= 1e0)
union all
select 1 as id, rpad('y', 100) as filler from dual;

-- Run this again once the DML statement below got started
exec dbms_stats.gather_table_stats(null, 't_source', no_invalidate=>false)

exec dbms_stats.gather_table_stats(null, 't_target', no_invalidate=>false)

----------------------------------------------------------------------------------------------------------------------------------
-- INSERT example --
-- Run above DBMS_STATS calls or any other command that invalidates the cursor during execution to force re-start of the cursor --
----------------------------------------------------------------------------------------------------------------------------------

set echo on timing on time on

-- alter session set tracefile_identifier = 'insert_restart';

-- alter session set events '10046 trace name context forever, level 12';

-- exec sys.dbms_monitor.session_trace_enable(waits => true, binds => true/*, plan_stat => 'all_executions'*/)

insert /* append */ into t_target (id, pkey, filler)
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 1 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 2 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 3 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
;

-- exec sys.dbms_monitor.session_trace_disable

----------------------------------------------------------------------------------------------------------------------------------
-- MERGE example --
-- Run above DBMS_STATS calls or any other command that invalidates the cursor during execution to force re-start of the cursor --
----------------------------------------------------------------------------------------------------------------------------------

set echo on timing on time on

merge /* append */ into t_target t
using (
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 1 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 2 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
use_hash(a b)
no_eliminate_oby
*/
a.id, 3 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
) s
on (s.id = t.id)
when not matched then
insert (id, pkey, filler) values (s.id, s.pkey, s.filler)
;
The idea of the test case is to maximise the time until each UNION ALL branch produces data to insert by performing an inefficient HASH JOIN (that in fact generates a Cartesian product and needs to apply a costly REGEXP filter on that huge intermediate result) and forcing a sort on the join result, so rows will only be handed over to the parent operations until all rows were processed in the join operation - and each branch generates data for a different partition of the target table. Typically it should take several seconds per branch to execute (if you need more time just un-comment the additional REGEXP_REPLACE filters), so you should have plenty of time to cause the invalidation from another session.

This means during the execution of each branch invalidating the cursor (for example by executing either of the two DBMS_STATS calls on the source or target table using NO_INVALIDATE=>FALSE) will lead to a re-start of the statement at the next attempt to write into a new target partition, possibly rolling back rows already inserted into other partitions.

Diagnostics
If you run the provided INSERT or MERGE statement on newer versions of Oracle that include the SQL_EXEC_START and SQL_EXEC_ID in V$ACTIVE_SESSION_HISTORY (or V$SESSION for that matter) and invalidate the cursor during execution and before a partition of the target table gets inserted for the first time then you can see that these entries change as the statement restarts.

In such cases the INVALIDATIONS and LOADS increase in V$SQL accordingly and the OBJECT_STATUS changes from INVALID_UNAUTH to VALID again with each re-start attempt. In newer versions where you can configure the "plan_stat" information for SQL trace to "all_executions" you'll find STAT lines for each execution attempt dumped to the trace file, but only a single final EXEC line, where the elapsed time covers all execution attempts.

The oldest version I've tested was 10.2.0.4, and that one showed already the re-start behaviour, although I would be inclined to think that this wasn't the case with older versions. So if anybody still runs older versions than 10.2.0.4 I would be interested to hear whether the behaviour reproduces or not.

Video Tutorial: XPLAN_ASH Active Session History - Part 7

Mon, 2016-01-04 03:00
The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 3: The Optimizer And Distribution Methods

Wed, 2015-12-30 13:45
As mentioned in the first and second part of this instalment the different available distribution methods of the new parallel FILTER are selected automatically by the optimizer - in this last post of this series I want to focus on that optimizer behaviour.

It looks like there are two new optimizer related parameters that control the behaviour of the new feature: "_px_filter_parallelized" is the overall switch to enable/disable the new parallel filter capability - and defaults to "true" in 12c, and "_px_filter_skew_handling" influences how the optimizer determines the distribution methods - the parameter naming suggests that it somehow has to do with some kind of "skew" - note that the internal parameter that handles the new automatic join skew handling is called "_px_join_skew_handling" - rather similar in name.

But even after playing around with the feature for quite a while I couldn't come up with a good test case where the optimizer chose a different distribution method based on the typical data distribution skew patterns - so that the expression used for the FILTER lookup had some more popular values than others. So I got in touch with Yasin Baskan - product manager for Parallel Execution at Oracle, asking what kind of skew is meant to see a difference in behaviour.

As it turns out "skew" means something different in this context here. When the mentioned parameter "_px_filter_skew_handling" is set to "true" (default value in 12c) the optimizer will choose a different distribution method based on the size of object driving the filter. According to my tests this effectively means: If the object is such small that only one granule (usually 13 blocks) per PX slave can be assigned the optimizer will use automatically a HASH distribution, otherwise - if the object is larger than this threshold - no re-distribution will be selected. I wasn't able to come up with an example where the optimizer automatically comes up with the other available distribution method, which is RANDOM / ROUND-ROBIN (see previous post). To demonstrate the point, here is a small example:

create table t2 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't2')

create table t3 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
parallel(4)
--opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
t3 t
--(select /*+ no_merge */ a.* from t3 a) t
--(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Default plan, no redistribution before parallel FILTER
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 5 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 5 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 5 | | | Q1,00 | PCWP | |
|* 5 | FILTER | | | | | | Q1,00 | PCWC | |
| 6 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
|* 8 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------

exec dbms_stats.set_table_stats(null, 't3', numblks => 52)

-- Setting stats of T3 to 52 (13 * DOP) blocks or smaller - HASH distribution will be used, 53 blocks or greater => no redistribution
-------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 5 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 5 | | | Q1,01 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 5 | | | Q1,01 | PCWP | |
|* 5 | FILTER | | | | | | Q1,01 | PCWP | |
| 6 | PX RECEIVE | | 90000 | 439K| 5 (20)| 00:00:01 | Q1,01 | PCWP | |
| 7 | PX SEND HASH | :TQ10000 | 90000 | 439K| 5 (20)| 00:00:01 | Q1,00 | P->P | HASH |
| 8 | PX BLOCK ITERATOR | | 90000 | 439K| 5 (20)| 00:00:01 | Q1,00 | PCWC | |
| 9 | TABLE ACCESS FULL| T3 | 90000 | 439K| 5 (20)| 00:00:01 | Q1,00 | PCWP | |
|* 10 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-------------------------------------------------------------------------------------------------------------------
So this example shows that the HASH distribution will be used by the optimizer if the object T3 driving the FILTER operation is 52 blocks or smaller, which corresponds to 13 blocks per PX slave at a degree of 4.

Now I find this behaviour pretty odd to explain - since usually you wouldn't want to use Parallel Execution on such a small object anyway. But things become even worse: Not only to me the "skew" handling based on the object size is questionable, but the behaviour can become a potential threat if the row source driving the FILTER operator no longer is a plain table but the result of a more complex operation, which can be simply a join or non-mergeable view:

-- Resetting stats to true size of table - this would mean no redistribution at a DOP of 4, see above
exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
parallel(4)
--opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
--t3 t
(select /*+ no_merge */ a.* from t3 a) t
--(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- But simply using a NO_MERGE hint on the select from the simple T3 row source results in an unnecessary HASH re-distribution
--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 9755K (1)| 00:06:22 | | | |
| 1 | SORT AGGREGATE | | 1 | 13 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 13 | | | Q1,01 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 13 | | | Q1,01 | PCWP | |
|* 5 | FILTER | | | | | | Q1,01 | PCWP | |
| 6 | PX RECEIVE | | 90000 | 1142K| 114 (0)| 00:00:01 | Q1,01 | PCWP | |
| 7 | PX SEND HASH | :TQ10000 | 90000 | 1142K| 114 (0)| 00:00:01 | Q1,00 | P->P | HASH |
| 8 | VIEW | | 90000 | 1142K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
| 9 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 10 | TABLE ACCESS FULL| T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
|* 11 | TABLE ACCESS FULL | T2 | 1 | 5 | 114 (0)| 00:00:01 | | | |
--------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
parallel(4)
--opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
--t3 t
--(select /*+ no_merge */ a.* from t3 a) t
(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- If we use a simple join as driving row source again a HASH re-distribution before the FILTER gets added
-- As a result the dreaded HASH JOIN BUFFERED will be used instead of the plain HASH JOIN
-------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 10 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10003 | 1 | 10 | | | Q1,03 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 10 | | | Q1,03 | PCWP | |
|* 5 | FILTER | | | | | | Q1,03 | PCWP | |
| 6 | PX RECEIVE | | 90000 | 878K| 229 (1)| 00:00:01 | Q1,03 | PCWP | |
| 7 | PX SEND HASH | :TQ10002 | 90000 | 878K| 229 (1)| 00:00:01 | Q1,02 | P->P | HASH |
|* 8 | HASH JOIN BUFFERED | | 90000 | 878K| 229 (1)| 00:00:01 | Q1,02 | PCWP | |
| 9 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 10 | PX SEND HYBRID HASH | :TQ10000 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | P->P | HYBRID HASH|
| 11 | STATISTICS COLLECTOR | | | | | | Q1,00 | PCWC | |
| 12 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 13 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
| 14 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 15 | PX SEND HYBRID HASH | :TQ10001 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | P->P | HYBRID HASH|
| 16 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWC | |
| 17 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 18 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-------------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
parallel(4)
opt_param('_px_filter_skew_handling' 'false')
*/ count(*) from
--t3 t
--(select /*+ no_merge */ a.* from t3 a) t
(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Disabling the FILTER skew handling behaviour means no re-distribution before the FILTER, and hence no HASH JOIN BUFFERED
-----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 32M (1)| 00:21:13 | | | |
| 1 | SORT AGGREGATE | | 1 | 10 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10002 | 1 | 10 | | | Q1,02 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 10 | | | Q1,02 | PCWP | |
|* 5 | FILTER | | | | | | Q1,02 | PCWC | |
|* 6 | HASH JOIN | | 90000 | 878K| 229 (1)| 00:00:01 | Q1,02 | PCWP | |
| 7 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 8 | PX SEND HYBRID HASH | :TQ10000 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | P->P | HYBRID HASH|
| 9 | STATISTICS COLLECTOR | | | | | | Q1,00 | PCWC | |
| 10 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWC | |
| 11 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,00 | PCWP | |
| 12 | PX RECEIVE | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,02 | PCWP | |
| 13 | PX SEND HYBRID HASH | :TQ10001 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | P->P | HYBRID HASH|
| 14 | PX BLOCK ITERATOR | | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWC | |
| 15 | TABLE ACCESS FULL | T3 | 90000 | 439K| 114 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 16 | TABLE ACCESS FULL | T2 | 1 | 5 | 412 (1)| 00:00:01 | | | |
-----------------------------------------------------------------------------------------------------------------------
So it looks like if the row source driving the parallel FILTER operator is complex (in this case by complex I mean not a simple table) the optimizer will always add a HASH distribution unconditionally before the FILTER. It it obvious that such a re-distribution adds overhead - it requires resources to perform. What is even worse is that in general the rules is: The more redistributions the more likely the dreaded buffering will be added to the execution plans, as can be seen from the example above, where the HASH JOIN turns into a HASH JOIN BUFFERED due to the HASH distribution by default added by the optimizer after the join and before the FILTER. By disabling the filter "skew" handling this in my opinion unnecessary redistribution doesn't show up and hence the HASH JOIN without buffering can be used in this example.

Summary
The new parallel FILTER operator comes with different distribution methods available to the optimizer. However, at present the way the optimizer determines automatically if and how to re-distribute the data seems to be questionable to me.

The skew handling is based on the size of the driving object - for very small objects a re-distribution gets added before the FILTER. For row sources driving the filter that are no simple tables the skew handling seems to add a re-distribution unconditionally.

For the reasons outlined at present I would recommend considering to disable the filter skew handling by setting the parameter "_px_filter_skew_handling" to "false", of course not without getting the blessing from Oracle Support before doing so - this should allow minimising the number of re-distributions added to an execution plan. Losing the capability of handling the "skew" caused by very small objects in my opinion is negligible in most cases.

New Version Of XPLAN_ASH Utility

Mon, 2015-12-28 13:35
A new version 4.22 of the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

This version primarily addresses an issue with 12c - if the HIST mode got used to pull ASH information from AWR in 12c it turned out that Oracle forgot to add the new "DELTA_READ_MEM_BYTES" columns to DBA_HIST_ACTIVE_SESS_HISTORY - although it got officially added to V$ACTIVE_SESSION_HISTORY in 12c. So now I had to implement several additional if/then/else constructs to the script to handle this inconsistency. It's the first time that the HIST view doesn't seem to reflect all columns from the V$ view - very likely an oversight rather than by design I assume.

Apart from that the I/O figures (Read Bytes / Write Bytes etc.) in the "Activity Timeline" make more sense for those cases where a process hasn't been sampled for several sample points (see below for more details).

Also in case an execution plan could not be found it is now made more obvious with a corresponding message that you might be able to pull the execution plan from AWR by using different ASH modes (MIXED / HIST).

Here are the notes from the change log:

- Fixed a funny bug that in 12c they have forgotton to add the DELTA_READ_MEM_BYTES to DBA_HIST_ACTIVE_SESS_HISTORY, so in HIST mode with 12c prior XPLAN_ASH versions could error out with invalid column name

- Change the way the I/O figures are treated in the "Activity Timeline based on ASH". Now the I/O per second is spread over the (previous) samples covered by DELTA_TIME. This should give a smoother representation of the I/O performed and much closer to what you see in Real-Time SQL Monitoring reports. The difference to prior versions is only visible in cases where a session wasn't sampled for quite a while and hence has a DELTA_TIME spanning multiple previous sample points. This also means that the I/O related columns in the "Activity Timeline based on ASH" now show only the PER SECOND values, no longer to the totals like prior versions

- Added a SET NULL "" in the configuration and initialization section for SQL*Plus environments that use a non-default SET NULL setting. This screwed up some internal switches so that XPLAN_ASH for example thought it's running in a S-ASH repository

- Added a note to the end of the output if no execution plan could be found and falling back to retrieving plan operation details from ASH. Also added the note to use MIXED or HIST ASH source option if no execution plan could be found in CURR mode, so execution plan has been purged from Shared Pool in the meanwhile

- Cloned the "cleanup" section from the end to the beginning of the script to ensure no current SQL*Plus environment settings influence the script execution. This is particularly relevant if the script execution gets cancelled before the final cleanup section is reached or some other, previous scripts left a mess behind

IT Tage 2015 - "Analysing and troubleshooting Parallel Execution" presentation material

Sun, 2015-12-20 12:24
Thanks to all attending my presentation "Analysing and troubleshooting Parallel Execution" at the IT Tage conference 2015 in Frankfurt, Germany. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.

DOAG 2015 - "Oracle 12c Parallel Execution New Features" presentation material

Sun, 2015-11-22 13:16
Thanks to the many attendees that came to my presentation "Oracle 12c Parallel Execution New Features" at the DOAG conference 2015. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.

Pages