Jonathan Lewis

Subscribe to Jonathan Lewis feed Jonathan Lewis
Just another Oracle weblog
Updated: 9 hours 42 min ago

Lock Types

Wed, 2018-03-28 06:24

Every now and again I have to check what a particular lock (or enqueue) type is for and what the associated parameter values represent. This often means I have to think about the names of a couple of views and a collection of columns – then create a few column formats to make the output readable (though sometimes I can take advantage of the “print_table()” procedure that Tom Kyte a long time ago.  It’s only takes a little time to get the code right, but it’s a nuisance when I’m in a hurry so I’ve just scribbled out a few lines of a script that takes a lock type as an input parameter and reports all the information I want.

rem
rem     Script:         lock_types.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2018
rem     Usage:          start lock_types {lock type}
rem

define m_lock_type='&1'

column  display new_value m_display

select
        case when substr(version,1,2) = '12'
        then
                'display_name'
        else
                'name'
        end                     display
from
        v$instance
;

set linesize 160
set pagesize 60
set trimspool on

column type             format a4
column name             format a32
column description      format a132
column id1_tag          format a32
column id2_tag          format a32
column is_user          format a4 heading "User"
column is_recycle       format a4 heading "Rcyc"

set feedback off
break on report skip 1

spool lock_types

select  *
from    V$lock_type
where   type = upper('&m_lock_type')
order by
        type
;

column  name                    format a42
column  parameter1              format a9
column  parameter2              format a24
column  parameter3              format a22
column  wait_class              format a14
column  display_name            format a42

select
        eve.name,
        eve.parameter1,
        eve.parameter2,
        eve.parameter3,
        eve.wait_class,
        nullif(eve.&m_display, eve.name) display_name
from
        v$event_name    eve
where
        eve.name like 'enq: ' || upper('&m_lock_type') || '%'
order by
        nullif(eve.wait_class,'Other'),
        eve.name
;

set feedback on

I’ve included a check (and hack) on the value of the major version because 12c introduced a “display_name” as well as a “name” for events, and the latter is sometimes a little more descriptive than the former, so it’s nice to have a single script that could print two different values for the versions that have them.

Here’s a sample of the output when I pass ‘IV’ as an input parameter:

TYPE NAME                             ID1_TAG                          ID2_TAG                          User Rcyc
---- -------------------------------- -------------------------------- -------------------------------- ---- ----
DESCRIPTION                                                                                                                              CON_ID
------------------------------------------------------------------------------------------------------------------------------------ ----------
IV   Library Cache Invalidation       object #                         time stamp                       NO   NO
Synchronizes library cache object invalidations across instances                                                                              0

NAME                                       PARAMETER PARAMETER2               PARAMETER3             WAIT_CLASS     DISPLAY_NAME
------------------------------------------ --------- ------------------------ ---------------------- -------------- ------------------------------------------
enq: IV -  contention                      type|mode id1                      id2                    Other

As you can see from the presence of the con_id column in v$lock_type this output came from a 12c instance. I picked the IV lock because that’s the one that prompted me to check the meanings of the id[12] and parameter[123] columns when a question about waits for the IV lock appeared recently on Oracle-L. I’ve got two reasons for carrying on with this particular example – first that it demonistrates that the descriptions can be wrong, second that it allows me to demonstrate a quick tip on translation.

The question on Oracle-L related to a 4-node RAC system and reported one instance suffering long waits on the IV enqueue on a fairly regular basis when running a particular batch task. The OP reported the following values as the p1, p2, p3 values from v$session while the wait was going on:


P1    type|mode 1230372869
P2    id1       1398361667
P3    id2                3

According to the details in v$lock_type the enqueue is about library cache invalidation across instances – and that fits the OPs complaint because the system is a RAC system. The id1 value is supposed to be an obj# (object_id), but the OP said it wasn’t; and the id2 value is supposed to be a timestamp, but 3 is an odd value for a timestamp (though it might represent – for example – the 3 second wait that is a common time-out interval for enqueues). So, clearly, the descriptions can be wrong.

Translation

Take another look at p1 and p2, and turn them into Hexadecimal:


1230372869 (dec) = 0x49560005 (hex)
1398361667 (dec) = 0x53594E43 (hex)

If you happen to be good with Hex and ASCII code you’ll know that 2-byte values in the range 41-5F are mostly the capital letters of the Roman alphabet (while 61 – 7f are mostly the lower case letters), so a second translation step gives us:


1230372869 (dec) = 0x49560005 (hex) = 'IV'  5
1398361667 (dec) = 0x53594E43 (hex) = 'SYNC'

The p1 parameter is described (correctly) as “type/mode” – this is an IV enqueue held or requested in mode 5; the p2 parameter is not an object number, it looks more like a text description of why the enqueue is being requested (the enqueue is, after all, described as being used to “synchronize library cache object invalidation”).

I still don’t know what the final parameter represents – I doubt if it’s really about a three second wait (but that could be checked by examining v$session_wait over a period of several seconds or v$session_wait_history), it might be an indication of the instance that the session is trying to synchronize with (though, again, that seems a long shot), or it might just be a “reason-code” describing why the synchronisation is necessary.

Whenever in doubt about the meaning for the various parameters, it’s always worth a quick dec -> hex -> ASCII check, just in case it offers some clues about the function of the enqueue.

 

Reference Costs

Mon, 2018-03-19 02:44

The partitioning option “partition by reference” is a very convenient option which keeps acquiring more cute little features, such as cascading truncates and cascading splits, as time passes – but what does it cost and would you use it if you don’t really need to.

When reference partitioning came into existence many years ago, I had already seen several performance disasters created by people’s enthusiasm for surrogate keys and the difficulties this introduced for partition elimination; so my first thought was that this was a mechanism that would have a hugely beneficial effect on systems which (in 20:20 – or 6:6 if you’re European – hindsight) had been badly designed and would otherwise need a lot of re-engineering to use partitioning effectively.

(Side note: Imagine you have partitioned an orders table on colX which is a column in the real (business-oriented) candidate key, but you’ve created a surrogate key which is used as the target for a foreign key from the order_lines tables – how do you get partition-wise joins between orders and order_lines if you haven’t got the partitioning column in the order_lines table ?)

So ref partitioning was a good way to workaround a big existing problem and, whatever overheads it introduced, the benefit was potentially so  huge that you wouldn’t care (or, probably, notice) that your system was less efficient than it ought to be. But what if you’re working on a new project and still have control of the physical design – how does that change the cost/benefit analysis.

It’s actually taken me several years to get round to producing a little demonstration to highlight one of the obvious costs of reference partitioning – even though it’s a very simple demo raising the obvious question: ‘how much work does Oracle have to do to find the right partition when inserting a “child” row ?’ If you chose to implement reference partitioning without asking that simple question you may be using a lot more machine resources than you really need to, although you may not actually be heading for a disastrous performance problem.

As a demonstration of the issue I’m going to set up something that approximates an order/order_lines model in two ways, one using reference partitioning and one using a copied column, to see what differences show up when you start loading data.

rem
rem     Script:         pt_ref.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table orders (
        id              number(10,0) not null,
        id_cust         number(10,0) not null,
        date_ordered    date not null,
        padding         varchar2(150)
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
);

create unique index ord_pk on orders (id);
alter table orders add constraint ord_pk primary key(id);

create table order_lines (
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by reference (orl_fk_ord)
;

create unique index orl_pk on order_lines (id_ord, line_number);
alter table order_lines add constraint orl_pk primary key (id_ord, line_number);

create table order_lines_2 (
        date_ordered    date,
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl2_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
)
;

create unique index orl2_pk on order_lines_2 (id_ord, line_number);
alter table order_lines_2 add constraint orl2_pk primary key (id_ord, line_number);

It’s a bit of a bodge job as far as modelling goes, but that’s to keep workload comparisons easy and make a point without writing too much code. All I’ve got is an orders table partitioned by date and an order_lines table that I want partitioned the same way. I’ve handled the requirement for partitioning order_lines in two ways, one is partition by reference and the other is to copy down the partitioning column from the orders table. (In my view the “real” key for an orders table should be (customer identifier, order date, counter) and if I thought efficient partitioning was going to be a necessary feature for scalability I would copy down all three columns. Depending on the nature of the business I would compress the primary key index on orders on one or two of the columns, and the foreign key index on order_lines on one, two, or three of its columns)

Now all I have to do is load some data into the tables. First the orders table:

insert into orders(
        id, id_cust, date_ordered, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        rownum                                  id,
        trunc(dbms_random.value(10000,20000))   id_cust,
        to_date('01-Jan-2018') +
                trunc((rownum-1)/100)           date_ordered,
        rpad('x',40)                            padding
from
        g,g
where
        rownum <= 2e4
;

commit;
execute dbms_stats.gather_table_stats(user,'orders')

This produces 100 orders per day, for 200 days which fits within the seven months of pre-declared partitions. I’ve gathered table stats on the table because that’s probably the best way to deal with any requirements for block cleanout after the insert. (Note: I’m avoiding interval partitioning in this example because that’s just another complication to add to the comparison and, as I reported a few days ago, introduces another massive inefficiency on data loading.)

Now I’ll insert some order_lines rows at 5 lines per order into the two versions of the order_lines tables. One of them, of course, has to have a date generated using the same algorithm that I used for the orders table. Note that I’ve made a call to dbms_random.seed(0) before each insert to guarantee that the same “random” values will be inserted in both table.

execute dbms_random.seed(0)

insert into order_lines_2(
        date_ordered, id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        to_date('01-Jan-2018') +
                trunc((rownum-1)/500)           date_ordered,
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;


execute dbms_random.seed(0)

insert into order_lines(
        id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;

What I haven’t shown in the code is the snapshot wrapping I used to check the session stats, system latch activity and system rowcache activity – which I thought would give me the best indication of any variation in workload. In fact, of course, the first and simplest variation was the elapsed time: 4.5 seconds for the ref partitioned table, 2.5 seconds for the explicitly created table (regardless of which insert I did first), and it was nearly all pure CPU time.

It turned out that the rowcache stats showed virtually no variation, and the latch stats only showed significant variation in the latches that I could have predicted from the session stats, and here are the most significant session stats that highlight and explain the difference in times:

Explicitly Created
------------------
CPU used by this session                                                   231
DB time                                                                    242
db block gets                                                          219,471
db block changes                                                        27,190
redo entries                                                            15,483
redo size                                                           24,790,224
HSC Heap Segment Block Changes                                           2,944

Ref partitioned
---------------
CPU used by this session                                                   515
DB time                                                                    532
db block gets                                                          615,979
db block changes                                                       418,025
redo entries                                                           209,918
redo size                                                           70,043,676
HSC Heap Segment Block Changes                                         100,048

These results were from 12.1.0.2, but the figures from 12.2.0.1 and 11.2.0.4 were similar though the CPU time dropped as the version number went up: what you’re seeing is the effect of turning an array insert (for the precreated table) into single row processing for the ref partitioned table. Basically it seems that for every row inserted Oracle has to do something to work out which partition the row should go into, and while it does that work it release any pins of buffers it would have been holding from the previous row’s insert; in other words, various optimisations relating to array inserts are not taking place.

  • Looking in more detail at the figures for the ref partition insert:
  • The 100,000 “HSC heap Segment Block Changes” equate to the 100,000 rows inserted into the table
  • Add the single row index updates to the primary key and you get 200,000 redo entries.
  • For every individual row inserted Oracle has to do a current mode (db block gets) check against the primary key of the orders table – but when array processing the root block can be pinned.

We can get a closer look at the differences by taking snapshots of v$segstat (or v$segment_statistics), to see the following (pre-created table on the left):


ORD_PK                                |   ORD_PK
  logical reads          199,440      |     logical reads          300,432
                                      |
ORDER_LINES_2 - P201801               |   ORDER_LINES - P201801
  logical reads            2,112      |     logical reads           16,960
  db block changes         1,280      |     db block changes        16,944
                                      |
ORDER_LINES_2 - P201802               |   ORDER_LINES - P201802
  logical reads            2,256      |     logical reads           16,144
  db block changes         1,248      |     db block changes        15,088
                                      |
ORDER_LINES_2 - P201803               |   ORDER_LINES - P201803
  logical reads            2,288      |     logical reads           17,264
  db block changes         1,376      |     db block changes        16,560
                                      |
ORDER_LINES_2 - P201804               |   ORDER_LINES - P201804
  logical reads            2,672      |     logical reads           16,768
  db block changes         1,280      |     db block changes        16,144
                                      |
ORDER_LINES_2 - P201805               |   ORDER_LINES - P201805
  logical reads            2,224      |     logical reads           17,472
  db block changes         1,264      |     db block changes        16,528
                                      |
ORDER_LINES_2 - P201806               |   ORDER_LINES - P201806
  logical reads            2,624      |     logical reads           16,800
  db block changes         1,328      |     db block changes        16,160
                                      |
ORDER_LINES_2 - P201807               |   ORDER_LINES - P201807
  logical reads            1,376      |     logical reads           10,368
  db block changes           864      |     db block changes        10,752
                                      |
ORL2_PK                               |   ORL_PK
  logical reads           10,640      |     logical reads          206,352
  db block changes         7,024      |     db block changes       104,656

The right hand data set does an extra 100,000 logical reads on the ORD_PK index (top set of lines) which I think are the 100,000 gets on the root block that was pinned for the table on the left – the numbers don’t quite add up, so there’s some extra complexity that I haven’t guessed correctly.

The insert into the ORL[2]_PK index (lines) is single row processed for the right hand table – with, I think, the logical reads recording two current gets per insert.

Every partition of the table, except the last, shows 15,000 db block changes, totalling a difference of about 100,000 db block changes corresponding to the single rows being inserted. Then ORL[2]_PK shows another 100,000 db block changes, giving us the 200,000 we saw as redo entries and 400,000 (when doubled up to allow for the undo) db block changes that we saw in total.

Finally we need to explain the difference of 400,000 db block gets between the two sets of session stats – and I think this is the extra 100,000 for ORD_PK, the 100,000 for the table inserts, and 200,000 for the ORL[2]_PK index, which I think might be explained as 100,000 as a current get that checks for “duplicate key” and 100,000 gets to do the actual insert.

Bottom Line, though – if you use reference partitioning every array insert seems to turn into single row processing with the attendant increase in buffer gets, undo and redo generated,  latch activity, and CPU used as Oracle checks for every single row which partition it should go into: and there doesn’t seem to be any optimisation that caters for “this row belongs in the same partition as the previous row”. You may decide that this extra cost due to reference partitioning is worth it for the benefits that reference partitioning supplies – it’s all down to what your application does, especially in terms of aging data perhaps – but it’s nice to know that this cost is there so that you can do a better cost/benefit analysis.

Footnote:

Interested readers might like to extend this test to a multi-layered set of ref-partitioned tables to see if the increase in overheads is linear or geometric.

 

Keeping Intervals

Thu, 2018-03-15 03:03

I’ve recently been reminded of a blog post I wrote a couple of years ago that discussed the issue of running into the hard limit of 2^20 -1 as the number of segments for a (composite) partitioned table – a problem that could arise in a relatively short time if you used a large number of hash subpartitions in an interval/hash composite partitioned table (you get about 2 years and 10 months of daily partitions at 1,024 subpartitions per day, for example).

A natural follow-on from that article is to think through a strategy for dropping old partitions sufficiently early that you don’t hit the limit as new partitions are created. This, of course, pretty much defeats the point of interval partitioning – instead of planning to add partitions “just in time” you now have to eliminate them “just in time”. Amongst other issues, we’re going to find that interval partitioning manages to re-introduce a problem with range partitioning that Oracle got rid of in Oracle 10g.

So let’s test the obvious option: drop the oldest partition(s) in time to keep head-room for new partitions; for convenience we’ll start with a simple interval partitioned table with a few pre-declared range partitions and a few automatically generated interval partitions. All the examples here were run under 12.1.0.2:


rem
rem     Script:         pt_merge.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(id, v1, padding)
partition by range (id) interval (1e4)
(
        partition p10000 values less than (1e4),
        partition p20000 values less than (2e4),
        partition p30000 values less than (3e4),
        partition p40000 values less than (4e4),
        partition p50000 values less than (5e4)
)
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.


I’ve created 100,000 rows and since the partitions I’ve pre-declared have an (unreachable) upper bound of only 50,000 Oracle will have added a further 6 partitions to the table to hold the data for values up to 110,000 (with just one row in the last partition). For testing purposes I’ve created the table in an otherwise empty tablespace so when I check the block address of each segment I can see the location (and size) of the segments so far. So here’s the list of names and locations:

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.

No surprises so far. So let’s pretend we know the dreaded ORA-14299 or ORA-14300 will be arriving soon and try to drop the first 5 partitions to keep the partition count below the limit. Here’s a cut-n-paste from an SQL*Plus session that tries to do that one partition at a time:

SQL> alter table t1 drop partition p10000;

Table altered.

SQL> alter table t1 drop partition p20000;

Table altered.

SQL> alter table t1 drop partition p30000;

Table altered.

SQL> alter table t1 drop partition p40000;

Table altered.

SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped

We can’t drop partition p50000 – it’s the highest partition that wasn’t created automatically, and we have to leave an “anchor” partition in place for interval partitioning to work from. By querying user_tab_partitions we can even see that this partition is flagged a little differently from the others:


select
        partition_name, interval, high_value 
from
        user_tab_partitions
where
        table_name = 'T1'
order by
        partition_position
;


PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P50000                 NO  5e4
SYS_P69844             YES 60000
SYS_P69845             YES 70000
SYS_P69846             YES 80000
SYS_P69847             YES 90000
SYS_P69848             YES 100000
SYS_P69849             YES 110000

7 rows selected.

So, at first sight, we’re stuck. If we’re dropping old partitions we will eventually get to a point where there’s only one “real” range partition at the bottom and then we can’t drop any more historic partitions. There are two solutions to this problem, explained a long time ago here and here by Harald van Breederode.

Option 1

Convert the interval partitioned table to a range partitioned table and back again, and if you know the interval (and you can always look it up in the data dictionary) there’s a quick and dirty way of doing that. Here’s a cut-n-paste demonstrating the method and effect:


SQL> alter table t1 set interval (10000);

1Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position ; 

PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P10000                 NO  1e4
P20000                 NO  2e4
P30000                 NO  3e4
P40000                 NO  4e4
P50000                 NO  5e4
SYS_P69850             NO  60000
SYS_P69851             NO  70000
SYS_P69852             NO  80000
SYS_P69853             NO  90000
SYS_P69854             NO  100000
SYS_P69855             NO  110000

11 rows selected.

SQL> select table_name, partitioning_type, interval from user_part_tables;

TABLE_NAME           PARTITION INTERVAL
-------------------- --------- --------------------
T1                   RANGE     1E4

1 row selected.

Every single partition has just become a range-based partition, but the table is still interval partitioned. This is a tidy solution, but there’s one obvious, generic, drawback to the method.  The “theory” of interval partitioning is that you don’t have to pre-create partitions in anticipation of the data arriving – so what will happen if a (possibly bad) row arrives weeks ahead of schedule and you find that Oracle has created (say) partition 85,001 with a gap of 12,000 partitions between the current high partition and the new one. If you use this “convert to range and back” trick then you’ll have a single partition covering the entire range where you were expecting (eventually) to have 12,000 partitions. Every time you convert from interval to range and back you’d better have code that checks if there are any gaps first, and then does loads of “split partition” –  or comes up with some other strategy – to address the side effects.

Option 2

When you’ve got just one range partition left, merge the bottom two partitions – this makes the next partition up a range partition without affecting any other partitions. After recreating the original table and dropping the first 4 partitions this is how things go:


SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped


SQL> alter table t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position;

PARTITION_NAME         INTERVAL             HIGH_VALUE
---------------------- -------------------- --------------------------
P_LOW                  NO                   60000
SYS_P69863             YES                  70000
SYS_P69864             YES                  80000
SYS_P69865             YES                  90000
SYS_P69866             YES                  100000
SYS_P69867             YES                  110000

6 rows selected.

Is this too good to be true ? Of course it is, but you may have to pause for a moment to think why. When you merge two partitions Oracle copies the contents of the two segments into a new segment – always; even if one of the two segments is empty. When you do a “split partition” Oracle runs a check to see if the split would leave all the data in a single segment and if it would then Oracle doesn’t do any copying but simply plays clever games in the data dictionary – unfortunately Oracle doesn’t use the same sort of trick to optimise a merge.

So the merge partition mechanism carries less risk than the “interval/range/interval”, but you either pay the cost of the merge or you carefully code the mechanism so that the bottom two partitions are always empty when you merge: for example you might always leave the bottom (range) partition empty and use your scheduled code to truncate (or exchange out) the lowest interval partition, then do the merge.

The good news

When you upgrade to 12.2.0.1 you can drop the lowest partition – and Oracle will simply turn the lowest interval partition currently in existence into a range partition. (That may be a bit of a nuisance if there’s a gap between the range partition and the current lowest interval partition.)

The Bad News

It doesn’t really matter which strategy you use to deal with this problem (even if you’ve upgraded to 12.2) – you still pay one other penalty for both mechanisms. And that’s the bit which re-introduces a problem that last existed in 9i.

Ask youself “How does Oracle know which interval a partition is for and what the limit is on the partitioning key ?” Then look at the data dictionary, or maybe build a very simple model and trace what happens when you use either of the methods above – but in your model create a significant number or partitions first. I’m going to take the data dictionary method – starting from the point where I’ve created and populated the table. Again this is cut-n-paste, and do note that I switch to the sys account after creating the table:


SQL> select object_id, object_name, subobject_name from user_objects;

 OBJECT_ID OBJECT_NAME          SUBOBJECT_NAME
---------- -------------------- ----------------------
    185164 T1
    185165 T1                   P10000
    185166 T1                   P20000
    185167 T1                   P30000
    185168 T1                   P40000
    185169 T1                   P50000
    185170 T1                   SYS_P69868
    185171 T1                   SYS_P69869
    185172 T1                   SYS_P69870
    185173 T1                   SYS_P69871
    185174 T1                   SYS_P69872
    185175 T1                   SYS_P69873

12 rows selected.

SQL> connect / as sysdba
Connected.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185165     185165         10
    185166     185166         20
    185167     185167         30
    185168     185168         40
    185169     185169         50
    185170     185170 2147483648
    185171     185171 2147483649
    185172     185172 2147483650
    185173     185173 2147483651
    185174     185174 2147483652
    185175     185175 2147483653

11 rows selected.

I’ve queried user_objects to find the object_id of the table then used that as the “base object number” (bo#) to query tabpart$, which holds the table partition definitions. Note how there are 5 partitions where the partition number goes up 10 at a time, and 6 where it goes up one at a time. Prior to 10g (and interval partitions, of course) the stored partition number would increase in steps of 1 but if you wanted to do a split, merge or drop partition (and the last of the three was the most significant one) every single partition position about the split/merge/drop point would have to be renumbered, and that was done by a single row update to the data dictionary to keep the numbering intact. The steps of 10 were introduced in 10g to deal with the inherent performance problems – particularly the shared pool catastrophe that this could cause.

The steps of 1 for interval partitions allows Oracle to keep track (easily) of what high_value each partition partition represents, and the highest legal partition. Try inserting the values 1,000,000 into the table and re-run the query against tabpart$ and you’ll see Oracle adding part# = 2147483743. So what do you think is going to happen if you try to apply the two mechanisms ?

If you do the interval/range/interval switch every interval part# will be renumbered so to follow the “increment by 10” pattern. If you drop partitions p10000 to p40000 nothing happens to the existing part# values until you get to the command to merge p50000 with the next partition up and then you see this:


SQL> alter table test_user.t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185177     185177         10
    185171     185171 2147483648
    185172     185172 2147483649
    185173     185173 2147483650
    185174     185174 2147483651
    185175     185175 2147483652
    185176     185176 2147483742

7 rows selected.


The newly merged partition is a new object, of course, so has a completely new obj# and dataobj#, and it’s been given the part# of 10 (the lowest value for a clean range-partitioned object). Every single interval partition has had its part# decreased by one. The lowest possible interval partition is always given the part# of 2147483648 (0x80000000) and the partition numbering increments by 1 from there onwards. (The numbering gets a little more subtle when you have composite partitioning but a similar approach takes place in tabcompart$).

Pause for thought – if you’re thinking of creating an interval partitioned table that could get close to a running level of 1 million partitions and you start to get rid of old partitions in any version of Oracle then each “drop/merge” partition will update about 1 million rows in the data dictionary – and that’s assuming you don’t have any local indexes that will need to be renumbered in the same way!

Here’s a critical part of the output from tkprof when I recreated the table with 1,000,000 rows – which means 101 partitions – and created a local index on it, before dropping the first 4 partitions and then enabled tracing just before merging the bottom interval partition with the anchor range partition.


update indpart$ set dataobj# = :1, part# = :2, flags = :3, ts# = :4, file# =
  :5, block# = :6, pctfree$ = :7, initrans = :8, maxtrans = :9, analyzetime =
  :10, samplesize = :11, rowcnt = :12, blevel = :13, leafcnt = :14, distkey =
  :15, lblkkey = :16, dblkkey = :17, clufac = :18, pctthres$ = :19
where
 obj# = :20


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.01          0         94        480          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.01       0.01          0         94        480          94


update tabpart$ set dataobj# = :1, part# = :2, ts# = :3, file# = :4, block# =
  :5, pctfree$ = :6, pctused$ = :7, initrans = :8, maxtrans = :9, flags = :10,
   analyzetime = :11, samplesize = :12, rowcnt = :13, blkcnt = :14, empcnt =
  :15, avgspc = :16, chncnt = :17, avgrln = :18
where
 obj# = :19


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.00          0        188        489          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.00       0.00          0        188        489          94

That’s not a lot of work for my little example with less than 100 partitions – but when you’ve got a million of them, with a handful of indexes, and the partitions have been created over time that’s going to turn into a lot of work that’s going to disrupt the shared pool for a long time, generate a lot of redo, and do a lot of disk reads and writes.

So be cautious with interval partitioning – even in 12.2 (and 18.1, possibly) the ease of use may disappear if you realise too late that you’re going to get into a cycle of partition maintenance.

Footnote for composite partitioning – the limits of 2^20-1 segments (hence subpartitions) still applies, but the necessary update is relevant only at the partition level, not at the subpartition level. The objects updated are tabcompart$ and indcompart$.

Update (included for ironic effect)

The day I posted this note my “Oracle Support Hot Topics” email with a report of the following bug:

Bug 19294302 : DBMS_REDEFINITION DOES NOT WORK WITH INTERVAL PARTITIONS

This was reported for 11.2.0.4, fixed in 12.2. The rediscovery information is:

ORA-14024 during copy_table_dep when the interim table is interval partitioned.

The problem arises if you change a table from simple range partitioned to range with interval – so might be relevant if you have a strategy of doing the interval/range/interval trick.

 

 

Deferred Invalidation

Tue, 2018-03-13 13:30

I was going to write an article on the way 12.2 has introduced the option for “deferred invalidation” for a number of DDL operations, but I did a quick google search before I started writing and found that both Franck Pachot and Richard Foote (yes, rebuild index is one of the operations) had got there long ago, so here are a couple of links – as much for my own benefit as anything else:

Richard Foote:

Franck Pachot:

Franck’s 2nd example may be particularly to some clients of mine who were had problems with SQL queries that were crashing (slowly and randomly) instead of running very efficiently because they were running queries against one subpartition of a table while another subpartition of the same table was subject to exchange. With a little bad luck in the timing an exchange that took place between a parse and an execute would cause a query to have its cursor invalidated and re-parsed in a way that failed to do (sub-)partition elimination the way it should have because the local indexes were in an indeterminate state.

 

Comparing Plans

Mon, 2018-03-12 03:01

It can be difficult to find the critical differences when comparing execution plans when you want to find out why the optimizer has changed its choice of plan and what may have happened to cause the change, and even the various diff_plan_xxx() functions in dbms_xplan don’t help very much, so I thought I’d write up an example that appeared recently on the ODC database forum to give people some ideas about how to approach the problem. There is, however, no simple algorithm that you can apply to narrow your focus down to the most probable cause of change, there are simply a few methods that have to be applied with a little flair and imagination.

We start with a query that has a “bad” plan and a “good” plan, with the implied questions: “What changed?” and “How do we get the good plan back?” If you’ve managed to capture the good plan the instant answer to “getting it back” is to create an SQL Plan Baseline for the query – but that’s really not a desirable thing to do every time a plan changes, and it’s not necessarily possible if the query keeps changing slightly over time. Ideally you should try identify why the plan can vary and work out how to manage that variation.

In theory you could collect all the statistics for all the tables and indexes involved in the query, making sure you’ve got both the “before” and “after” statistics; then you need to check for the actual values used for any bind variables just in case some specific values have a particular impact on the optimizer’s calculations; but gathering all that data is a lot of work, and having all the raw data that lets you (in theory) deduce why the plan has changed will probably not help very much unless you’re very lucky.

If you have to go through an exhaustive analysis to solve the problem it’s a lot of tedious work. (This, in part, is why tools like SQLd360 and SQLTXPLAIN are so helpful – as a first step they make it easy to collect a lot of the information you might need.) But with a simple text editor and a purely visual approach it’s often quite easy to narrow the focus down to the most relevant part of the plan – and that’s what I’m going to demonstrate with this example.

This posting is going to get quite long because the good and bad plans are 108 and 110 lines respectively, so I’m going to use the “hide/reveal” html codes inline as we work through a few steps of analysis that eliminate sections of the plan and make it easier to read and understand. So here’s the bad plan:

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                               |       |       |       |   244K(100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID               | ACCOUNT                       |     1 |    25 |       |     5  (20)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS              |                               |       |       |       |            |          |
|   3 |    BITMAP AND                              |                               |       |       |       |            |          |
|   4 |     BITMAP INDEX SINGLE VALUE              | IDX_ACCOUNT_OBJ_SOURCE        |       |       |       |            |          |
|   5 |     BITMAP INDEX SINGLE VALUE              | IDX_ACCOUNT_OBJ_TYPE          |       |       |       |            |          |
|   6 |     BITMAP CONVERSION FROM ROWIDS          |                               |       |       |       |            |          |
|   7 |      SORT ORDER BY                         |                               |       |       |       |            |          |
|   8 |       INDEX RANGE SCAN                     | UQ_ACCOUNT                    |     5 |       |       |     2   (0)| 00:00:01 |
|   9 |   TABLE ACCESS BY INDEX ROWID              | ACCOUNT                       |     1 |    25 |       |     5  (20)| 00:00:01 |
|  10 |    BITMAP CONVERSION TO ROWIDS             |                               |       |       |       |            |          |
|  11 |     BITMAP AND                             |                               |       |       |       |            |          |
|  12 |      BITMAP INDEX SINGLE VALUE             | IDX_ACCOUNT_OBJ_SOURCE        |       |       |       |            |          |
|  13 |      BITMAP INDEX SINGLE VALUE             | IDX_ACCOUNT_OBJ_TYPE          |       |       |       |            |          |
|  14 |      BITMAP CONVERSION FROM ROWIDS         |                               |       |       |       |            |          |
|  15 |       SORT ORDER BY                        |                               |       |       |       |            |          |
|  16 |        INDEX RANGE SCAN                    | UQ_ACCOUNT                    |     5 |       |       |     2   (0)| 00:00:01 |
|  17 |  HASH JOIN RIGHT OUTER                     |                               | 93634 |    28M|       |   244K  (1)| 00:49:00 |
|  18 |   VIEW                                     | X_SECURITY_ACCOUNTS_V         |   434 | 12152 |       |   368   (2)| 00:00:05 |
|  19 |    HASH GROUP BY                           |                               |   434 |  7812 |       |   368   (2)| 00:00:05 |
|  20 |     VIEW                                   |                               |   434 |  7812 |       |   367   (2)| 00:00:05 |
|  21 |      SORT UNIQUE                           |                               |   434 |   101K|       |   367   (2)| 00:00:05 |
|  22 |       UNION-ALL                            |                               |       |       |       |            |          |
|  23 |        CONNECT BY WITH FILTERING (UNIQUE)  |                               |       |       |       |            |          |
|  24 |         CONCATENATION                      |                               |       |       |       |            |          |
|  25 |          INLIST ITERATOR                   |                               |       |       |       |            |          |
|  26 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |     1 |    69 |       |     7   (0)| 00:00:01 |
|  27 |            INDEX RANGE SCAN                | UQ_ACCOUNT                    |     1 |       |       |     6   (0)| 00:00:01 |
|  28 |          INLIST ITERATOR                   |                               |       |       |       |            |          |
|  29 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |     1 |    69 |       |     7   (0)| 00:00:01 |
|  30 |            INDEX RANGE SCAN                | UQ_ACCOUNT                    |     1 |       |       |     6   (0)| 00:00:01 |
|  31 |         NESTED LOOPS                       |                               |    36 |  4752 |       |    34   (0)| 00:00:01 |
|  32 |          CONNECT BY PUMP                   |                               |       |       |       |            |          |
|  33 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                       |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  34 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT            |    18 |       |       |     2   (0)| 00:00:01 |
|  35 |        CONNECT BY WITH FILTERING (UNIQUE)  |                               |       |       |       |            |          |
|  36 |         CONCATENATION                      |                               |       |       |       |            |          |
|  37 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                       |     1 |    69 |       |     4   (0)| 00:00:01 |
|  38 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT            |     1 |       |       |     3   (0)| 00:00:01 |
|  39 |          INLIST ITERATOR                   |                               |       |       |       |            |          |
|  40 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |    20 |  1380 |       |    47   (0)| 00:00:01 |
|  41 |            INDEX RANGE SCAN                | IDX_ACCOUNT_PARENT            |    20 |       |       |    39   (0)| 00:00:01 |
|  42 |         NESTED LOOPS                       |                               |   374 | 49368 |       |   261   (0)| 00:00:04 |
|  43 |          CONNECT BY PUMP                   |                               |       |       |       |            |          |
|  44 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                       |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  45 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT            |    18 |       |       |     2   (0)| 00:00:01 |
|  46 |        TABLE ACCESS BY INDEX ROWID         | ACCOUNT                       |     1 |    26 |       |     4   (0)| 00:00:01 |
|  47 |         INDEX RANGE SCAN                   | IDX_ACCOUNT_PARENT            |     1 |       |       |     3   (0)| 00:00:01 |
|  48 |   HASH JOIN RIGHT OUTER                    |                               | 52211 |    14M|       |   244K  (1)| 00:48:55 |
|  49 |    VIEW                                    | X_SECURITY_OBJECTS_V          |  1976 | 55328 |       |  3303   (1)| 00:00:40 |
|  50 |     HASH GROUP BY                          |                               |  1976 | 35568 |       |  3303   (1)| 00:00:40 |
|  51 |      VIEW                                  |                               |  1976 | 35568 |       |  3302   (1)| 00:00:40 |
|  52 |       SORT UNIQUE                          |                               |  1976 |   482K|       |  3302   (1)| 00:00:40 |
|  53 |        UNION-ALL                           |                               |       |       |       |            |          |
|  54 |         CONNECT BY WITH FILTERING (UNIQUE) |                               |       |       |       |            |          |
|  55 |          TABLE ACCESS FULL                 | ORGANIZATION                  |  1965 |   132K|       |   551   (2)| 00:00:07 |
|  56 |          HASH JOIN                         |                               |     5 |   675 |       |  1099   (1)| 00:00:14 |
|  57 |           CONNECT BY PUMP                  |                               |       |       |       |            |          |
|  58 |           TABLE ACCESS FULL                | ORGANIZATION                  | 76990 |  5939K|       |   547   (1)| 00:00:07 |
|  59 |         FILTER                             |                               |       |       |       |            |          |
|  60 |          CONNECT BY WITH FILTERING (UNIQUE)|                               |       |       |       |            |          |
|  61 |           TABLE ACCESS FULL                | ORGANIZATION                  |     4 |   272 |       |   548   (1)| 00:00:07 |
|  62 |           HASH JOIN                        |                               |     1 |   124 |       |  1096   (1)| 00:00:14 |
|  63 |            CONNECT BY PUMP                 |                               |       |       |       |            |          |
|  64 |            TABLE ACCESS FULL               | ORGANIZATION                  | 76990 |  5112K|       |   548   (1)| 00:00:07 |
|  65 |         INLIST ITERATOR                    |                               |       |       |       |            |          |
|  66 |          TABLE ACCESS BY INDEX ROWID       | ORGANIZATION                  |     1 |    37 |       |     5   (0)| 00:00:01 |
|  67 |           INDEX RANGE SCAN                 | UQ_ORGANIZATION_NATURAL       |     1 |       |       |     4   (0)| 00:00:01 |
|  68 |    HASH JOIN                               |                               | 52211 |    12M|       |   241K  (1)| 00:48:16 |
|  69 |     TABLE ACCESS FULL                      | CURRENCY                      |    40 |  2560 |       |     3   (0)| 00:00:01 |
|  70 |     HASH JOIN                              |                               | 52211 |  9993K|  3072K|   241K  (1)| 00:48:16 |
|  71 |      TABLE ACCESS FULL                     | ACCOUNT                       | 45570 |  2536K|       |   274   (1)| 00:00:04 |
|  72 |      HASH JOIN                             |                               | 52211 |  7087K|       |   240K  (1)| 00:48:06 |
|  73 |       TABLE ACCESS FULL                    | COMPANY                       |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  74 |       HASH JOIN RIGHT OUTER                |                               | 52211 |  6628K|       |   240K  (1)| 00:48:06 |
|  75 |        VIEW                                |                               |    19 |   247 |       |    18  (12)| 00:00:01 |
|  76 |         FILTER                             |                               |       |       |       |            |          |
|  77 |          CONNECT BY WITH FILTERING         |                               |       |       |       |            |          |
|  78 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |     1 |    69 |       |     3   (0)| 00:00:01 |
|  79 |            INDEX RANGE SCAN                | UQ_ACCOUNT                    |     1 |       |       |     2   (0)| 00:00:01 |
|  80 |           NESTED LOOPS                     |                               |    18 |  2376 |       |    13   (0)| 00:00:01 |
|  81 |            CONNECT BY PUMP                 |                               |       |       |       |            |          |
|  82 |            TABLE ACCESS BY INDEX ROWID     | ACCOUNT                       |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  83 |             INDEX RANGE SCAN               | IDX_ACCOUNT_PARENT            |    18 |       |       |     2   (0)| 00:00:01 |
|  84 |        HASH JOIN RIGHT OUTER               |                               | 52211 |  5965K|       |   240K  (1)| 00:48:06 |
|  85 |         VIEW                               |                               |     2 |    26 |       |    30   (7)| 00:00:01 |
|  86 |          FILTER                            |                               |       |       |       |            |          |
|  87 |           CONNECT BY WITH FILTERING        |                               |       |       |       |            |          |
|  88 |            TABLE ACCESS BY INDEX ROWID     | ORGANIZATION                  |     1 |    69 |       |    12   (0)| 00:00:01 |
|  89 |             BITMAP CONVERSION TO ROWIDS    |                               |       |       |       |            |          |
|  90 |              BITMAP AND                    |                               |       |       |       |            |          |
|  91 |               BITMAP INDEX SINGLE VALUE    | IDX_ORGANIZATION_OBJ_TYPE     |       |       |       |            |          |
|  92 |               BITMAP OR                    |                               |       |       |       |            |          |
|  93 |                BITMAP INDEX SINGLE VALUE   | IDX_ORGANIZATION_OBJ_COMPANY  |       |       |       |            |          |
|  94 |                BITMAP INDEX SINGLE VALUE   | IDX_ORGANIZATION_OBJ_COMPANY  |       |       |       |            |          |
|  95 |                BITMAP INDEX SINGLE VALUE   | IDX_ORGANIZATION_OBJ_COMPANY  |       |       |       |            |          |
|  96 |                BITMAP INDEX SINGLE VALUE   | IDX_ORGANIZATION_OBJ_COMPANY  |       |       |       |            |          |
|  97 |                BITMAP INDEX SINGLE VALUE   | IDX_ORGANIZATION_OBJ_COMPANY  |       |       |       |            |          |
|  98 |            NESTED LOOPS                    |                               |     1 |   135 |       |    15   (0)| 00:00:01 |
|  99 |             CONNECT BY PUMP                |                               |       |       |       |            |          |
| 100 |             TABLE ACCESS BY INDEX ROWID    | ORGANIZATION                  |     1 |    79 |       |     3   (0)| 00:00:01 |
| 101 |              INDEX RANGE SCAN              | IDX_ORGANIZATION_PARENT       |     1 |       |       |     2   (0)| 00:00:01 |
| 102 |         HASH JOIN                          |                               | 52211 |  5302K|       |   240K  (1)| 00:48:06 |
| 103 |          TABLE ACCESS FULL                 | TIME                          |     9 |   171 |       |     3   (0)| 00:00:01 |
| 104 |          NESTED LOOPS                      |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 105 |           NESTED LOOPS                     |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 106 |            NESTED LOOPS                    |                               |  5756 |   185K|       |   552   (1)| 00:00:07 |
| 107 |             FAST DUAL                      |                               |     1 |       |       |     2   (0)| 00:00:01 |
| 108 |             TABLE ACCESS FULL              | ORGANIZATION                  |  5756 |   185K|       |   550   (1)| 00:00:07 |
| 109 |            INDEX RANGE SCAN                | IDX_FACT_PLAN_SUPP_K_COMP_ORG |    65 |       |       |    19   (0)| 00:00:01 |
| 110 |           TABLE ACCESS BY INDEX ROWID      | FACT_PLAN_SUPP                |   110 |  5720 |       |    82   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------------------------

and the good plan

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                              |       |       |       |   545K(100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID               | ACCOUNT                      |     1 |    25 |       |     5  (20)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS              |                              |       |       |       |            |          |
|   3 |    BITMAP AND                              |                              |       |       |       |            |          |
|   4 |     BITMAP INDEX SINGLE VALUE              | IDX_ACCOUNT_OBJ_SOURCE       |       |       |       |            |          |
|   5 |     BITMAP INDEX SINGLE VALUE              | IDX_ACCOUNT_OBJ_TYPE         |       |       |       |            |          |
|   6 |     BITMAP CONVERSION FROM ROWIDS          |                              |       |       |       |            |          |
|   7 |      SORT ORDER BY                         |                              |       |       |       |            |          |
|   8 |       INDEX RANGE SCAN                     | UQ_ACCOUNT                   |     5 |       |       |     2   (0)| 00:00:01 |
|   9 |   TABLE ACCESS BY INDEX ROWID              | ACCOUNT                      |     1 |    25 |       |     5  (20)| 00:00:01 |
|  10 |    BITMAP CONVERSION TO ROWIDS             |                              |       |       |       |            |          |
|  11 |     BITMAP AND                             |                              |       |       |       |            |          |
|  12 |      BITMAP INDEX SINGLE VALUE             | IDX_ACCOUNT_OBJ_SOURCE       |       |       |       |            |          |
|  13 |      BITMAP INDEX SINGLE VALUE             | IDX_ACCOUNT_OBJ_TYPE         |       |       |       |            |          |
|  14 |      BITMAP CONVERSION FROM ROWIDS         |                              |       |       |       |            |          |
|  15 |       SORT ORDER BY                        |                              |       |       |       |            |          |
|  16 |        INDEX RANGE SCAN                    | UQ_ACCOUNT                   |     5 |       |       |     2   (0)| 00:00:01 |
|  17 |  HASH JOIN RIGHT OUTER                     |                              |   539K|   162M|       |   545K  (2)| 01:49:03 |
|  18 |   VIEW                                     | X_SECURITY_ACCOUNTS_V        |   434 | 12152 |       |   368   (2)| 00:00:05 |
|  19 |    HASH GROUP BY                           |                              |   434 |  7812 |       |   368   (2)| 00:00:05 |
|  20 |     VIEW                                   |                              |   434 |  7812 |       |   367   (1)| 00:00:05 |
|  21 |      SORT UNIQUE                           |                              |   434 |   101K|       |   367   (1)| 00:00:05 |
|  22 |       UNION-ALL                            |                              |       |       |       |            |          |
|  23 |        CONNECT BY WITH FILTERING (UNIQUE)  |                              |       |       |       |            |          |
|  24 |         CONCATENATION                      |                              |       |       |       |            |          |
|  25 |          INLIST ITERATOR                   |                              |       |       |       |            |          |
|  26 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                      |     1 |    69 |       |     7   (0)| 00:00:01 |
|  27 |            INDEX RANGE SCAN                | UQ_ACCOUNT                   |     1 |       |       |     6   (0)| 00:00:01 |
|  28 |          INLIST ITERATOR                   |                              |       |       |       |            |          |
|  29 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                      |     1 |    69 |       |     7   (0)| 00:00:01 |
|  30 |            INDEX RANGE SCAN                | UQ_ACCOUNT                   |     1 |       |       |     6   (0)| 00:00:01 |
|  31 |         NESTED LOOPS                       |                              |    36 |  4752 |       |    34   (0)| 00:00:01 |
|  32 |          CONNECT BY PUMP                   |                              |       |       |       |            |          |
|  33 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  34 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |    18 |       |       |     2   (0)| 00:00:01 |
|  35 |        CONNECT BY WITH FILTERING (UNIQUE)  |                              |       |       |       |            |          |
|  36 |         CONCATENATION                      |                              |       |       |       |            |          |
|  37 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |     1 |    69 |       |     4   (0)| 00:00:01 |
|  38 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |     1 |       |       |     3   (0)| 00:00:01 |
|  39 |          INLIST ITERATOR                   |                              |       |       |       |            |          |
|  40 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                      |    20 |  1380 |       |    47   (0)| 00:00:01 |
|  41 |            INDEX RANGE SCAN                | IDX_ACCOUNT_PARENT           |    20 |       |       |    39   (0)| 00:00:01 |
|  42 |         NESTED LOOPS                       |                              |   374 | 49368 |       |   261   (0)| 00:00:04 |
|  43 |          CONNECT BY PUMP                   |                              |       |       |       |            |          |
|  44 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  45 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |    18 |       |       |     2   (0)| 00:00:01 |
|  46 |        TABLE ACCESS BY INDEX ROWID         | ACCOUNT                      |     1 |    26 |       |     4   (0)| 00:00:01 |
|  47 |         INDEX RANGE SCAN                   | IDX_ACCOUNT_PARENT           |     1 |       |       |     3   (0)| 00:00:01 |
|  48 |   HASH JOIN RIGHT OUTER                    |                              |   300K|    82M|       |   544K  (2)| 01:48:58 |
|  49 |    VIEW                                    | X_SECURITY_OBJECTS_V         |  1920 | 53760 |       |  3303   (1)| 00:00:40 |
|  50 |     HASH GROUP BY                          |                              |  1920 | 34560 |       |  3303   (1)| 00:00:40 |
|  51 |      VIEW                                  |                              |  1920 | 34560 |       |  3302   (1)| 00:00:40 |
|  52 |       SORT UNIQUE                          |                              |  1920 |   468K|       |  3302   (1)| 00:00:40 |
|  53 |        UNION-ALL                           |                              |       |       |       |            |          |
|  54 |         CONNECT BY WITH FILTERING (UNIQUE) |                              |       |       |       |            |          |
|  55 |          TABLE ACCESS FULL                 | ORGANIZATION                 |  1909 |   128K|       |   551   (2)| 00:00:07 |
|  56 |          HASH JOIN                         |                              |     5 |   675 |       |  1099   (1)| 00:00:14 |
|  57 |           CONNECT BY PUMP                  |                              |       |       |       |            |          |
|  58 |           TABLE ACCESS FULL                | ORGANIZATION                 | 76836 |  5927K|       |   547   (1)| 00:00:07 |
|  59 |         FILTER                             |                              |       |       |       |            |          |
|  60 |          CONNECT BY WITH FILTERING (UNIQUE)|                              |       |       |       |            |          |
|  61 |           TABLE ACCESS FULL                | ORGANIZATION                 |     4 |   272 |       |   548   (1)| 00:00:07 |
|  62 |           HASH JOIN                        |                              |     1 |   124 |       |  1096   (1)| 00:00:14 |
|  63 |            CONNECT BY PUMP                 |                              |       |       |       |            |          |
|  64 |            TABLE ACCESS FULL               | ORGANIZATION                 | 76836 |  5102K|       |   548   (1)| 00:00:07 |
|  65 |         INLIST ITERATOR                    |                              |       |       |       |            |          |
|  66 |          TABLE ACCESS BY INDEX ROWID       | ORGANIZATION                 |     1 |    37 |       |     5   (0)| 00:00:01 |
|  67 |           INDEX RANGE SCAN                 | UQ_ORGANIZATION_NATURAL      |     1 |       |       |     4   (0)| 00:00:01 |
|  68 |    HASH JOIN                               |                              |   300K|    74M|       |   541K  (2)| 01:48:18 |
|  69 |     TABLE ACCESS FULL                      | CURRENCY                     |    40 |  2560 |       |     3   (0)| 00:00:01 |
|  70 |     HASH JOIN RIGHT OUTER                  |                              |   300K|    56M|       |   541K  (2)| 01:48:18 |
|  71 |      VIEW                                  |                              |    19 |   247 |       |    18  (12)| 00:00:01 |
|  72 |       FILTER                               |                              |       |       |       |            |          |
|  73 |        CONNECT BY WITH FILTERING           |                              |       |       |       |            |          |
|  74 |         TABLE ACCESS BY INDEX ROWID        | ACCOUNT                      |     1 |    69 |       |     3   (0)| 00:00:01 |
|  75 |          INDEX RANGE SCAN                  | UQ_ACCOUNT                   |     1 |       |       |     2   (0)| 00:00:01 |
|  76 |         NESTED LOOPS                       |                              |    18 |  2376 |       |    13   (0)| 00:00:01 |
|  77 |          CONNECT BY PUMP                   |                              |       |       |       |            |          |
|  78 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  79 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |    18 |       |       |     2   (0)| 00:00:01 |
|  80 |      HASH JOIN                             |                              |   300K|    52M|  3072K|   541K  (2)| 01:48:18 |
|  81 |       TABLE ACCESS FULL                    | ACCOUNT                      | 45564 |  2536K|       |   274   (1)| 00:00:04 |
|  82 |       HASH JOIN                            |                              |   300K|    36M|  2576K|   539K  (2)| 01:47:50 |
|  83 |        TABLE ACCESS FULL                   | ORGANIZATION                 | 58577 |  1887K|       |   550   (1)| 00:00:07 |
|  84 |        HASH JOIN                           |                              |  6984K|   619M|       |   503K  (2)| 01:40:44 |
|  85 |         TABLE ACCESS FULL                  | COMPANY                      |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  86 |         HASH JOIN RIGHT OUTER              |                              |  6984K|   559M|       |   503K  (2)| 01:40:44 |
|  87 |          VIEW                              |                              |     2 |    26 |       |    29   (7)| 00:00:01 |
|  88 |           FILTER                           |                              |       |       |       |            |          |
|  89 |            CONNECT BY WITH FILTERING       |                              |       |       |       |            |          |
|  90 |             TABLE ACCESS BY INDEX ROWID    | ORGANIZATION                 |     1 |    69 |       |    12   (0)| 00:00:01 |
|  91 |              BITMAP CONVERSION TO ROWIDS   |                              |       |       |       |            |          |
|  92 |               BITMAP AND                   |                              |       |       |       |            |          |
|  93 |                BITMAP INDEX SINGLE VALUE   | IDX_ORGANIZATION_OBJ_TYPE    |       |       |       |            |          |
|  94 |                BITMAP OR                   |                              |       |       |       |            |          |
|  95 |                 BITMAP INDEX SINGLE VALUE  | IDX_ORGANIZATION_OBJ_COMPANY |       |       |       |            |          |
|  96 |                 BITMAP INDEX SINGLE VALUE  | IDX_ORGANIZATION_OBJ_COMPANY |       |       |       |            |          |
|  97 |                 BITMAP INDEX SINGLE VALUE  | IDX_ORGANIZATION_OBJ_COMPANY |       |       |       |            |          |
|  98 |                 BITMAP INDEX SINGLE VALUE  | IDX_ORGANIZATION_OBJ_COMPANY |       |       |       |            |          |
|  99 |                 BITMAP INDEX SINGLE VALUE  | IDX_ORGANIZATION_OBJ_COMPANY |       |       |       |            |          |
| 100 |             NESTED LOOPS                   |                              |     1 |   135 |       |    15   (0)| 00:00:01 |
| 101 |              CONNECT BY PUMP               |                              |       |       |       |            |          |
| 102 |              TABLE ACCESS BY INDEX ROWID   | ORGANIZATION                 |     1 |    79 |       |     3   (0)| 00:00:01 |
| 103 |               INDEX RANGE SCAN             | IDX_ORGANIZATION_PARENT      |     1 |       |       |     2   (0)| 00:00:01 |
| 104 |          HASH JOIN                         |                              |  6984K|   472M|       |   503K  (2)| 01:40:43 |
| 105 |           NESTED LOOPS                     |                              |     9 |   171 |       |     5   (0)| 00:00:01 |
| 106 |            FAST DUAL                       |                              |     1 |       |       |     2   (0)| 00:00:01 |
| 107 |            TABLE ACCESS FULL               | TIME                         |     9 |   171 |       |     3   (0)| 00:00:01 |
| 108 |           TABLE ACCESS FULL                | FACT_PLAN_SUPP               |    84M|  4207M|       |   503K  (2)| 01:40:37 |
-----------------------------------------------------------------------------------------------------------------------------------

You might want to cut and paste these into two separate windows in a text editor – which is how I worked the plans (using vi on a linux box).

I’ve highlighted two lines from the bad plan – operations 1 and 17, which are both indented one place; the level of indent tells me that operation 17 is the start of plan for the main query and operation 1 is the start of plan for an inline scalar subquery, so my first check is that operations 1 to 16 (highlighted in the good plan) are the same in both plans: they are, so they’re probably not the source of the problem and can be eliminated from the plans. In passing, you’ll notice that operation 9 is a table access which is indented one more place – depending on exact version of Oracle (and which patches have fixed which bugs) this might mean there are two scalar subqueries in the select list, it might mean there’s one scalar subquery with a “pushed” filter subquery, or (most likely on recent versions of Oracle) it might mean that there’s a case or decode() operator with two scalar subqueries invoked by the operator.

I’ve also highlighted two other chunks in both plans. Looking at the bad plan, operation 75 is an inline view derived from operations 76 – 83; this corresponds to the view at operation 71 in the good plan, derived from operations 72 – 79: a quick check shows that these two sets of lines are identical, so they can be eliminated and replaced with just a single VIEW operation which I’ll identify by adding in an object Name of VIEW_1.

Similarly I’ve highlighted operations 86 – 101 in the bad plan and 88 – 103 in the good plan – a visual check shows that they are identical (with just a tiny difference in the final VIEW cost) so I’ve eliminated those lines and given the controlling VIEW operation an object Name of VIEW_2.

With the three chunks removed, this is what the plans look like:

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                               |       |       |       |   244K(100)|          |
|  17 |  HASH JOIN RIGHT OUTER                     |                               | 93634 |    28M|       |   244K  (1)| 00:49:00 |
|  18 |   VIEW                                     | X_SECURITY_ACCOUNTS_V         |   434 | 12152 |       |   368   (2)| 00:00:05 |
|  19 |    HASH GROUP BY                           |                               |   434 |  7812 |       |   368   (2)| 00:00:05 |
|  20 |     VIEW                                   |                               |   434 |  7812 |       |   367   (2)| 00:00:05 |
|  21 |      SORT UNIQUE                           |                               |   434 |   101K|       |   367   (2)| 00:00:05 |
|  22 |       UNION-ALL                            |                               |       |       |       |            |          |
|  23 |        CONNECT BY WITH FILTERING (UNIQUE)  |                               |       |       |       |            |          |
|  24 |         CONCATENATION                      |                               |       |       |       |            |          |
|  25 |          INLIST ITERATOR                   |                               |       |       |       |            |          |
|  26 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |     1 |    69 |       |     7   (0)| 00:00:01 |
|  27 |            INDEX RANGE SCAN                | UQ_ACCOUNT                    |     1 |       |       |     6   (0)| 00:00:01 |
|  28 |          INLIST ITERATOR                   |                               |       |       |       |            |          |
|  29 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |     1 |    69 |       |     7   (0)| 00:00:01 |
|  30 |            INDEX RANGE SCAN                | UQ_ACCOUNT                    |     1 |       |       |     6   (0)| 00:00:01 |
|  31 |         NESTED LOOPS                       |                               |    36 |  4752 |       |    34   (0)| 00:00:01 |
|  32 |          CONNECT BY PUMP                   |                               |       |       |       |            |          |
|  33 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                       |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  34 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT            |    18 |       |       |     2   (0)| 00:00:01 |
|  35 |        CONNECT BY WITH FILTERING (UNIQUE)  |                               |       |       |       |            |          |
|  36 |         CONCATENATION                      |                               |       |       |       |            |          |
|  37 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                       |     1 |    69 |       |     4   (0)| 00:00:01 |
|  38 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT            |     1 |       |       |     3   (0)| 00:00:01 |
|  39 |          INLIST ITERATOR                   |                               |       |       |       |            |          |
|  40 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                       |    20 |  1380 |       |    47   (0)| 00:00:01 |
|  41 |            INDEX RANGE SCAN                | IDX_ACCOUNT_PARENT            |    20 |       |       |    39   (0)| 00:00:01 |
|  42 |         NESTED LOOPS                       |                               |   374 | 49368 |       |   261   (0)| 00:00:04 |
|  43 |          CONNECT BY PUMP                   |                               |       |       |       |            |          |
|  44 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                       |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  45 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT            |    18 |       |       |     2   (0)| 00:00:01 |
|  46 |        TABLE ACCESS BY INDEX ROWID         | ACCOUNT                       |     1 |    26 |       |     4   (0)| 00:00:01 |
|  47 |         INDEX RANGE SCAN                   | IDX_ACCOUNT_PARENT            |     1 |       |       |     3   (0)| 00:00:01 |
|  48 |   HASH JOIN RIGHT OUTER                    |                               | 52211 |    14M|       |   244K  (1)| 00:48:55 |
|  49 |    VIEW                                    | X_SECURITY_OBJECTS_V          |  1976 | 55328 |       |  3303   (1)| 00:00:40 |
|  50 |     HASH GROUP BY                          |                               |  1976 | 35568 |       |  3303   (1)| 00:00:40 |
|  51 |      VIEW                                  |                               |  1976 | 35568 |       |  3302   (1)| 00:00:40 |
|  52 |       SORT UNIQUE                          |                               |  1976 |   482K|       |  3302   (1)| 00:00:40 |
|  53 |        UNION-ALL                           |                               |       |       |       |            |          |
|  54 |         CONNECT BY WITH FILTERING (UNIQUE) |                               |       |       |       |            |          |
|  55 |          TABLE ACCESS FULL                 | ORGANIZATION                  |  1965 |   132K|       |   551   (2)| 00:00:07 |
|  56 |          HASH JOIN                         |                               |     5 |   675 |       |  1099   (1)| 00:00:14 |
|  57 |           CONNECT BY PUMP                  |                               |       |       |       |            |          |
|  58 |           TABLE ACCESS FULL                | ORGANIZATION                  | 76990 |  5939K|       |   547   (1)| 00:00:07 |
|  59 |         FILTER                             |                               |       |       |       |            |          |
|  60 |          CONNECT BY WITH FILTERING (UNIQUE)|                               |       |       |       |            |          |
|  61 |           TABLE ACCESS FULL                | ORGANIZATION                  |     4 |   272 |       |   548   (1)| 00:00:07 |
|  62 |           HASH JOIN                        |                               |     1 |   124 |       |  1096   (1)| 00:00:14 |
|  63 |            CONNECT BY PUMP                 |                               |       |       |       |            |          |
|  64 |            TABLE ACCESS FULL               | ORGANIZATION                  | 76990 |  5112K|       |   548   (1)| 00:00:07 |
|  65 |         INLIST ITERATOR                    |                               |       |       |       |            |          |
|  66 |          TABLE ACCESS BY INDEX ROWID       | ORGANIZATION                  |     1 |    37 |       |     5   (0)| 00:00:01 |
|  67 |           INDEX RANGE SCAN                 | UQ_ORGANIZATION_NATURAL       |     1 |       |       |     4   (0)| 00:00:01 |
|  68 |    HASH JOIN                               |                               | 52211 |    12M|       |   241K  (1)| 00:48:16 |
|  69 |     TABLE ACCESS FULL                      | CURRENCY                      |    40 |  2560 |       |     3   (0)| 00:00:01 |
|  70 |     HASH JOIN                              |                               | 52211 |  9993K|  3072K|   241K  (1)| 00:48:16 |
|  71 |      TABLE ACCESS FULL                     | ACCOUNT                       | 45570 |  2536K|       |   274   (1)| 00:00:04 |
|  72 |      HASH JOIN                             |                               | 52211 |  7087K|       |   240K  (1)| 00:48:06 |
|  73 |       TABLE ACCESS FULL                    | COMPANY                       |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  74 |       HASH JOIN RIGHT OUTER                |                               | 52211 |  6628K|       |   240K  (1)| 00:48:06 |
|  75 |        VIEW                                | --- VIEW_1                    |    19 |   247 |       |    18  (12)| 00:00:01 |
|  84 |        HASH JOIN RIGHT OUTER               |                               | 52211 |  5965K|       |   240K  (1)| 00:48:06 |
|  85 |         VIEW                               | --- VIEW_2                    |     2 |    26 |       |    30   (7)| 00:00:01 |
| 102 |         HASH JOIN                          |                               | 52211 |  5302K|       |   240K  (1)| 00:48:06 |
| 103 |          TABLE ACCESS FULL                 | TIME                          |     9 |   171 |       |     3   (0)| 00:00:01 |
| 104 |          NESTED LOOPS                      |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 105 |           NESTED LOOPS                     |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 106 |            NESTED LOOPS                    |                               |  5756 |   185K|       |   552   (1)| 00:00:07 |
| 107 |             FAST DUAL                      |                               |     1 |       |       |     2   (0)| 00:00:01 |
| 108 |             TABLE ACCESS FULL              | ORGANIZATION                  |  5756 |   185K|       |   550   (1)| 00:00:07 |
| 109 |            INDEX RANGE SCAN                | IDX_FACT_PLAN_SUPP_K_COMP_ORG |    65 |       |       |    19   (0)| 00:00:01 |
| 110 |           TABLE ACCESS BY INDEX ROWID      | FACT_PLAN_SUPP                |   110 |  5720 |       |    82   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                              |       |       |       |   545K(100)|          |
|  17 |  HASH JOIN RIGHT OUTER                     |                              |   539K|   162M|       |   545K  (2)| 01:49:03 |
|  18 |   VIEW                                     | X_SECURITY_ACCOUNTS_V        |   434 | 12152 |       |   368   (2)| 00:00:05 |
|  19 |    HASH GROUP BY                           |                              |   434 |  7812 |       |   368   (2)| 00:00:05 |
|  20 |     VIEW                                   |                              |   434 |  7812 |       |   367   (1)| 00:00:05 |
|  21 |      SORT UNIQUE                           |                              |   434 |   101K|       |   367   (1)| 00:00:05 |
|  22 |       UNION-ALL                            |                              |       |       |       |            |          |
|  23 |        CONNECT BY WITH FILTERING (UNIQUE)  |                              |       |       |       |            |          |
|  24 |         CONCATENATION                      |                              |       |       |       |            |          |
|  25 |          INLIST ITERATOR                   |                              |       |       |       |            |          |
|  26 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                      |     1 |    69 |       |     7   (0)| 00:00:01 |
|  27 |            INDEX RANGE SCAN                | UQ_ACCOUNT                   |     1 |       |       |     6   (0)| 00:00:01 |
|  28 |          INLIST ITERATOR                   |                              |       |       |       |            |          |
|  29 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                      |     1 |    69 |       |     7   (0)| 00:00:01 |
|  30 |            INDEX RANGE SCAN                | UQ_ACCOUNT                   |     1 |       |       |     6   (0)| 00:00:01 |
|  31 |         NESTED LOOPS                       |                              |    36 |  4752 |       |    34   (0)| 00:00:01 |
|  32 |          CONNECT BY PUMP                   |                              |       |       |       |            |          |
|  33 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  34 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |    18 |       |       |     2   (0)| 00:00:01 |
|  35 |        CONNECT BY WITH FILTERING (UNIQUE)  |                              |       |       |       |            |          |
|  36 |         CONCATENATION                      |                              |       |       |       |            |          |
|  37 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |     1 |    69 |       |     4   (0)| 00:00:01 |
|  38 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |     1 |       |       |     3   (0)| 00:00:01 |
|  39 |          INLIST ITERATOR                   |                              |       |       |       |            |          |
|  40 |           TABLE ACCESS BY INDEX ROWID      | ACCOUNT                      |    20 |  1380 |       |    47   (0)| 00:00:01 |
|  41 |            INDEX RANGE SCAN                | IDX_ACCOUNT_PARENT           |    20 |       |       |    39   (0)| 00:00:01 |
|  42 |         NESTED LOOPS                       |                              |   374 | 49368 |       |   261   (0)| 00:00:04 |
|  43 |          CONNECT BY PUMP                   |                              |       |       |       |            |          |
|  44 |          TABLE ACCESS BY INDEX ROWID       | ACCOUNT                      |    18 |  1368 |       |    10   (0)| 00:00:01 |
|  45 |           INDEX RANGE SCAN                 | IDX_ACCOUNT_PARENT           |    18 |       |       |     2   (0)| 00:00:01 |
|  46 |        TABLE ACCESS BY INDEX ROWID         | ACCOUNT                      |     1 |    26 |       |     4   (0)| 00:00:01 |
|  47 |         INDEX RANGE SCAN                   | IDX_ACCOUNT_PARENT           |     1 |       |       |     3   (0)| 00:00:01 |
|  48 |   HASH JOIN RIGHT OUTER                    |                              |   300K|    82M|       |   544K  (2)| 01:48:58 |
|  49 |    VIEW                                    | X_SECURITY_OBJECTS_V         |  1920 | 53760 |       |  3303   (1)| 00:00:40 |
|  50 |     HASH GROUP BY                          |                              |  1920 | 34560 |       |  3303   (1)| 00:00:40 |
|  51 |      VIEW                                  |                              |  1920 | 34560 |       |  3302   (1)| 00:00:40 |
|  52 |       SORT UNIQUE                          |                              |  1920 |   468K|       |  3302   (1)| 00:00:40 |
|  53 |        UNION-ALL                           |                              |       |       |       |            |          |
|  54 |         CONNECT BY WITH FILTERING (UNIQUE) |                              |       |       |       |            |          |
|  55 |          TABLE ACCESS FULL                 | ORGANIZATION                 |  1909 |   128K|       |   551   (2)| 00:00:07 |
|  56 |          HASH JOIN                         |                              |     5 |   675 |       |  1099   (1)| 00:00:14 |
|  57 |           CONNECT BY PUMP                  |                              |       |       |       |            |          |
|  58 |           TABLE ACCESS FULL                | ORGANIZATION                 | 76836 |  5927K|       |   547   (1)| 00:00:07 |
|  59 |         FILTER                             |                              |       |       |       |            |          |
|  60 |          CONNECT BY WITH FILTERING (UNIQUE)|                              |       |       |       |            |          |
|  61 |           TABLE ACCESS FULL                | ORGANIZATION                 |     4 |   272 |       |   548   (1)| 00:00:07 |
|  62 |           HASH JOIN                        |                              |     1 |   124 |       |  1096   (1)| 00:00:14 |
|  63 |            CONNECT BY PUMP                 |                              |       |       |       |            |          |
|  64 |            TABLE ACCESS FULL               | ORGANIZATION                 | 76836 |  5102K|       |   548   (1)| 00:00:07 |
|  65 |         INLIST ITERATOR                    |                              |       |       |       |            |          |
|  66 |          TABLE ACCESS BY INDEX ROWID       | ORGANIZATION                 |     1 |    37 |       |     5   (0)| 00:00:01 |
|  67 |           INDEX RANGE SCAN                 | UQ_ORGANIZATION_NATURAL      |     1 |       |       |     4   (0)| 00:00:01 |
|  68 |    HASH JOIN                               |                              |   300K|    74M|       |   541K  (2)| 01:48:18 |
|  69 |     TABLE ACCESS FULL                      | CURRENCY                     |    40 |  2560 |       |     3   (0)| 00:00:01 |
|  70 |     HASH JOIN RIGHT OUTER                  |                              |   300K|    56M|       |   541K  (2)| 01:48:18 |
|  71 |      VIEW                                  | --- VIEW_1                   |    19 |   247 |       |    18  (12)| 00:00:01 |
|  80 |      HASH JOIN                             |                              |   300K|    52M|  3072K|   541K  (2)| 01:48:18 |
|  81 |       TABLE ACCESS FULL                    | ACCOUNT                      | 45564 |  2536K|       |   274   (1)| 00:00:04 |
|  82 |       HASH JOIN                            |                              |   300K|    36M|  2576K|   539K  (2)| 01:47:50 |
|  83 |        TABLE ACCESS FULL                   | ORGANIZATION                 | 58577 |  1887K|       |   550   (1)| 00:00:07 |
|  84 |        HASH JOIN                           |                              |  6984K|   619M|       |   503K  (2)| 01:40:44 |
|  85 |         TABLE ACCESS FULL                  | COMPANY                      |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  86 |         HASH JOIN RIGHT OUTER              |                              |  6984K|   559M|       |   503K  (2)| 01:40:44 |
|  87 |          VIEW                              | --- VIEW_2                   |     2 |    26 |       |    29   (7)| 00:00:01 |
| 104 |          HASH JOIN                         |                              |  6984K|   472M|       |   503K  (2)| 01:40:43 |
| 105 |           NESTED LOOPS                     |                              |     9 |   171 |       |     5   (0)| 00:00:01 |
| 106 |            FAST DUAL                       |                              |     1 |       |       |     2   (0)| 00:00:01 |
| 107 |            TABLE ACCESS FULL               | TIME                         |     9 |   171 |       |     3   (0)| 00:00:01 |
| 108 |           TABLE ACCESS FULL                | FACT_PLAN_SUPP               |    84M|  4207M|       |   503K  (2)| 01:40:37 |
-----------------------------------------------------------------------------------------------------------------------------------

At this point I go into “astronomer” mode with a blink comparator. I’ll open two text files, one for each plan, in separate windows, align the two windows and texts and then “alt-tab” between the windows waiting for the “blink” that highlights where the texts differ.

In the case of these two plans the numeric parts slip sideways by one character on the blink – fortunately I find that a small enough step that I don’t worry about doing a manual edit to re-align them.  In the first stage I’m only looking for changes in the operations and objects, with a secondary check on the numbers. For these two plans I can see very quickly that they match up to operation 69 (highlighted in both plans above).

I’ve highlighted a few other lines in the first 69 operations of the plans:

  • Operation 17 has significantly different rows/costs in the two plans 93,600/244K vs. 539K/545K
  • Operation 48 also has significant differences: 52,000/244K vs, 300K/545K – and as the second child of the hash join at operation 17 explains the previous differences
  • Operations 49 and 68 are the two child operations of the hash join at operation 48, with operation 68 carrying the differences
  • Operation 70 (where our plans shapes start to differ) is the second child of the hash join at operation 68 and introduces the differences

Although there are a few small differences in cardinality (row) and cost estimates in other lines in the first 70 lines of the plan they are small and (probably) ignorable by comparison.

The upshot of this is that we can assume (reasonably confidently) that the critical differences occur in the section of the plan that is the hash join [right outer] from operation 70 and its descendents, so our plans reduce to  the following (which emulates in text the type of appearance you would get from OEM, or TOAD, or SQL*Developer by collapsing parts of plans):

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                               |       |       |       |   244K(100)|          |
...
|  17 |  HASH JOIN RIGHT OUTER                     |                               | 93634 |    28M|       |   244K  (1)| 00:49:00 |
|  18 |   VIEW                                     | X_SECURITY_ACCOUNTS_V         |   434 | 12152 |       |   368   (2)| 00:00:05 |
...
|  48 |   HASH JOIN RIGHT OUTER                    |                               | 52211 |    14M|       |   244K  (1)| 00:48:55 |
|  49 |    VIEW                                    | X_SECURITY_OBJECTS_V          |  1976 | 55328 |       |  3303   (1)| 00:00:40 |
...
|  68 |    HASH JOIN                               |                               | 52211 |    12M|       |   241K  (1)| 00:48:16 |
|  69 |     TABLE ACCESS FULL                      | CURRENCY                      |    40 |  2560 |       |     3   (0)| 00:00:01 |
|  70 |     HASH JOIN                              |                               | 52211 |  9993K|  3072K|   241K  (1)| 00:48:16 |
|  71 |      TABLE ACCESS FULL                     | ACCOUNT                       | 45570 |  2536K|       |   274   (1)| 00:00:04 |
|  72 |      HASH JOIN                             |                               | 52211 |  7087K|       |   240K  (1)| 00:48:06 |
|  73 |       TABLE ACCESS FULL                    | COMPANY                       |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  74 |       HASH JOIN RIGHT OUTER                |                               | 52211 |  6628K|       |   240K  (1)| 00:48:06 |
|  75 |        VIEW                                | --- VIEW_1                    |    19 |   247 |       |    18  (12)| 00:00:01 |
|  84 |        HASH JOIN RIGHT OUTER               |                               | 52211 |  5965K|       |   240K  (1)| 00:48:06 |
|  85 |         VIEW                               | --- VIEW_2                    |     2 |    26 |       |    30   (7)| 00:00:01 |
| 102 |         HASH JOIN                          |                               | 52211 |  5302K|       |   240K  (1)| 00:48:06 |
| 103 |          TABLE ACCESS FULL                 | TIME                          |     9 |   171 |       |     3   (0)| 00:00:01 |
| 104 |          NESTED LOOPS                      |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 105 |           NESTED LOOPS                     |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 106 |            NESTED LOOPS                    |                               |  5756 |   185K|       |   552   (1)| 00:00:07 |
| 107 |             FAST DUAL                      |                               |     1 |       |       |     2   (0)| 00:00:01 |
| 108 |             TABLE ACCESS FULL              | ORGANIZATION                  |  5756 |   185K|       |   550   (1)| 00:00:07 |
| 109 |            INDEX RANGE SCAN                | IDX_FACT_PLAN_SUPP_K_COMP_ORG |    65 |       |       |    19   (0)| 00:00:01 |
| 110 |           TABLE ACCESS BY INDEX ROWID      | FACT_PLAN_SUPP                |   110 |  5720 |       |    82   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                              |       |       |       |   545K(100)|          |
...
|  17 |  HASH JOIN RIGHT OUTER                     |                              |   539K|   162M|       |   545K  (2)| 01:49:03 |
|  18 |   VIEW                                     | X_SECURITY_ACCOUNTS_V        |   434 | 12152 |       |   368   (2)| 00:00:05 |
...
|  48 |   HASH JOIN RIGHT OUTER                    |                              |   300K|    82M|       |   544K  (2)| 01:48:58 |
|  49 |    VIEW                                    | X_SECURITY_OBJECTS_V         |  1920 | 53760 |       |  3303   (1)| 00:00:40 |
...
|  68 |    HASH JOIN                               |                              |   300K|    74M|       |   541K  (2)| 01:48:18 |
|  69 |     TABLE ACCESS FULL                      | CURRENCY                     |    40 |  2560 |       |     3   (0)| 00:00:01 |
|  70 |     HASH JOIN RIGHT OUTER                  |                              |   300K|    56M|       |   541K  (2)| 01:48:18 |
|  71 |      VIEW                                  | --- VIEW_1                   |    19 |   247 |       |    18  (12)| 00:00:01 |
|  80 |      HASH JOIN                             |                              |   300K|    52M|  3072K|   541K  (2)| 01:48:18 |
|  81 |       TABLE ACCESS FULL                    | ACCOUNT                      | 45564 |  2536K|       |   274   (1)| 00:00:04 |
|  82 |       HASH JOIN                            |                              |   300K|    36M|  2576K|   539K  (2)| 01:47:50 |
|  83 |        TABLE ACCESS FULL                   | ORGANIZATION                 | 58577 |  1887K|       |   550   (1)| 00:00:07 |
|  84 |        HASH JOIN                           |                              |  6984K|   619M|       |   503K  (2)| 01:40:44 |
|  85 |         TABLE ACCESS FULL                  | COMPANY                      |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  86 |         HASH JOIN RIGHT OUTER              |                              |  6984K|   559M|       |   503K  (2)| 01:40:44 |
|  87 |          VIEW                              | --- VIEW_2                   |     2 |    26 |       |    29   (7)| 00:00:01 |
| 104 |          HASH JOIN                         |                              |  6984K|   472M|       |   503K  (2)| 01:40:43 |
| 105 |           NESTED LOOPS                     |                              |     9 |   171 |       |     5   (0)| 00:00:01 |
| 106 |            FAST DUAL                       |                              |     1 |       |       |     2   (0)| 00:00:01 |
| 107 |            TABLE ACCESS FULL               | TIME                         |     9 |   171 |       |     3   (0)| 00:00:01 |
| 108 |           TABLE ACCESS FULL                | FACT_PLAN_SUPP               |    84M|  4207M|       |   503K  (2)| 01:40:37 |
-----------------------------------------------------------------------------------------------------------------------------------

The image we get by stripping out (or collapsing) all the bits which looks as if they aren’t significant is so much simpler to grasp than the original, and gives us a couple of extra ideas to follow if we were simply doing a generic “optimise this query” exercise starting from a position of ignorance. In fact all we’re after in this particular exercise is a pointer to the most likely place where “something changed” as this may tall us why the plan changed and what we can do to get back the old plan and stabilise it for the future.

Now that we’ve seen the condensed plans I’m going to reduce them one more time, and take out the collapse option, to hide everything above operation 70 because that’s a good focal point for the investigating the change:

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------------------
|  70 |     HASH JOIN                              |                               | 52211 |  9993K|  3072K|   241K  (1)| 00:48:16 |
|  71 |      TABLE ACCESS FULL                     | ACCOUNT                       | 45570 |  2536K|       |   274   (1)| 00:00:04 |
|  72 |      HASH JOIN                             |                               | 52211 |  7087K|       |   240K  (1)| 00:48:06 |
|  73 |       TABLE ACCESS FULL                    | COMPANY                       |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  74 |       HASH JOIN RIGHT OUTER                |                               | 52211 |  6628K|       |   240K  (1)| 00:48:06 |
|  75 |        VIEW                                | --- VIEW_1                    |    19 |   247 |       |    18  (12)| 00:00:01 |
|  84 |        HASH JOIN RIGHT OUTER               |                               | 52211 |  5965K|       |   240K  (1)| 00:48:06 |
|  85 |         VIEW                               | --- VIEW_2                    |     2 |    26 |       |    30   (7)| 00:00:01 |
| 102 |         HASH JOIN                          |                               | 52211 |  5302K|       |   240K  (1)| 00:48:06 |
| 103 |          TABLE ACCESS FULL                 | TIME                          |     9 |   171 |       |     3   (0)| 00:00:01 |
| 104 |          NESTED LOOPS                      |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 105 |           NESTED LOOPS                     |                               |   634K|    51M|       |   240K  (1)| 00:48:06 |
| 106 |            NESTED LOOPS                    |                               |  5756 |   185K|       |   552   (1)| 00:00:07 |
| 107 |             FAST DUAL                      |                               |     1 |       |       |     2   (0)| 00:00:01 |
| 108 |             TABLE ACCESS FULL              | ORGANIZATION                  |  5756 |   185K|       |   550   (1)| 00:00:07 |
| 109 |            INDEX RANGE SCAN                | IDX_FACT_PLAN_SUPP_K_COMP_ORG |    65 |       |       |    19   (0)| 00:00:01 |
| 110 |           TABLE ACCESS BY INDEX ROWID      | FACT_PLAN_SUPP                |   110 |  5720 |       |    82   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------------------
|  70 |     HASH JOIN RIGHT OUTER                  |                              |   300K|    56M|       |   541K  (2)| 01:48:18 |
|  71 |      VIEW                                  | --- VIEW_1                   |    19 |   247 |       |    18  (12)| 00:00:01 |
|  80 |      HASH JOIN                             |                              |   300K|    52M|  3072K|   541K  (2)| 01:48:18 |
|  81 |       TABLE ACCESS FULL                    | ACCOUNT                      | 45564 |  2536K|       |   274   (1)| 00:00:04 |
|  82 |       HASH JOIN                            |                              |   300K|    36M|  2576K|   539K  (2)| 01:47:50 |
|  83 |        TABLE ACCESS FULL                   | ORGANIZATION                 | 58577 |  1887K|       |   550   (1)| 00:00:07 |
|  84 |        HASH JOIN                           |                              |  6984K|   619M|       |   503K  (2)| 01:40:44 |
|  85 |         TABLE ACCESS FULL                  | COMPANY                      |   112 |  1008 |       |     3   (0)| 00:00:01 |
|  86 |         HASH JOIN RIGHT OUTER              |                              |  6984K|   559M|       |   503K  (2)| 01:40:44 |
|  87 |          VIEW                              | --- VIEW_2                   |     2 |    26 |       |    29   (7)| 00:00:01 |
| 104 |          HASH JOIN                         |                              |  6984K|   472M|       |   503K  (2)| 01:40:43 |
| 105 |           NESTED LOOPS                     |                              |     9 |   171 |       |     5   (0)| 00:00:01 |
| 106 |            FAST DUAL                       |                              |     1 |       |       |     2   (0)| 00:00:01 |
| 107 |            TABLE ACCESS FULL               | TIME                         |     9 |   171 |       |     3   (0)| 00:00:01 |
| 108 |           TABLE ACCESS FULL                | FACT_PLAN_SUPP               |    84M|  4207M|       |   503K  (2)| 01:40:37 |
-----------------------------------------------------------------------------------------------------------------------------------

Comparing the two fragments we can see that they both have a funny little nested loop driven by a select from dual: perhaps that’s just there to inject a value like sysdate-N, or a single value from a PL/SQL function, or a call to sys_context() that can act as a “variable constant”. It probably doesn’t matter where that tablescan of dual goes, really, so long as it happens early enough; it’s guaranteed to be only one row so the fact that it drives a nested loop is only a formality. (I’d guess that it’s probably the leading table in the join order, but appears very late in the plan because of “swap_join_inputs” introduced for all the hash joins.)

Apart from that one common nested loop (with different tables as the inner table) almost everything in both plans is a hash join with full tablescans – with one exception: in the bad plan the join to fact_plan_supp is an index-access nested loop driven by a tablescan of organization (technically it’s driven by the result of the join between dual and the result of the tablescan of organization, of course); in the good plan we see a full tablescan of fact_plan_supp – which is millions of rows at a huge fraction (503K/541K) of the total cost of the query.

It seems slightly surprising that such a massive and expensive tablescan should produce the good plan, but it does, and we have to ask why. Since the bad plan has the nested loop, and the nested loop is driven by table organization we need to look at that table in both plans. In the bad plan the prediction is 5,756 rows; in the good plan the prediction is 58,577 rows. The dramatic change in that prediction is probably the cause of the change in plan. Imagine scaling up the number of rows in the bad plan by the factor of 10 implied by those figures and the cost of the nested loop (operations 104 and 105) would go up to 240K – vastly more than the cost of the tablescan chosen for the good plan. If the prediction for the good plan is in the right ball park then the error in the cardinality estimate the optimizer has made for the organization table is almost certain to be the cause of the performance problem seen in the bad plan.

The next task, of course, is to find out why the prediction changed – which means looking at the statistics on the organization table and finding out the values used for any predicates against that one table. (Unfortunately these plans report no predicate section – and the absence of any asterisks (“*”) against the operation lines suggests the plans came from the AWR, which doesn’t capture the predicate information.) There are a couple of common possibilities suggested by the scale of the difference:

  • A column that should have a histogram has lost its histogram
  • A column that didn’t have a histogram has acquired one
  • Some column data is skewed and has a histogram, and an unlucky choice of bind variable on the first call produced a very low estimate of cardinality
  • Stats collection for histograms is sampled in 11g-  and still sampled for hybrid histograms in 12c – an unlucky sample produced a very misleading histogram
  • A bind variable has gone out of range of the low/high values with a significant pro-rated scale-down on the row estimate

After doing the analysis above, I highlighted the organization table to the originator of the question – who came back with the information that a column had recently been added to the table by the process of dropping and recreating the table then gathering stats. If you do that, and have the stats collection method_opt left at its default, you won’t get any histograms initially: the default method_opt is “for all columns size auto” – which lets Oracle decide which columns should be tested for histograms based on historic use and data skew. If it’s a “new” table there will be no historic use, so no attempt to generate histograms. The first option on my list may be the right one.

Footnote:

It took about 15 minutes to copy the plans and do the analysis – it’s taken about 3 hours (plus a few more minutes this morning for a last spelling and grammar check) to explain what I did … and even then the explanation could do with more work.

 

Column Groups

Thu, 2018-03-08 00:54

There’s a question on the ODC database forum about column groups that throws up an interesting side point. The OP is looking at a query like the following and asking about which column groups might help the optimizer get the best plan:

select
        a.*, b.*, c.*
from
        a, b, c
where
        a.id   = b.id
and     a.id1  = b.id1
and     a.id   = c.id
and     b.id2  = c.id2
and     a.id4  = 66
and     b.id7  = 44
and     c.id88 = 88
;

I’m going to start by being a bit boring about presentation and meaning (although this query is fairly obviously intended to be devoid of meaningful content) and rearrange the query because if I had a from clause reading “a, b, c” it would be because I hoped the optimizer would find that to be the best join order – and if that were the case I would have written the predicate section to display the order and timing of the work needed:

select
        a.*, b.*, c.*
from
        a, b, c
where
        a.id4  = 66
--
and     b.id   = a.id
and     b.id1  = a.id1
and     b.id7  = 44
--
and     c.id   = a.id
and     c.id2  = b.id2
and     c.id88 = 88
;

Having cosmetically enhanced (to my mind) the query, I’ll now ask the question: “Would it make sense to create column groups on a(id, id1), b(id, id1) and c(id, id2) ?”

I’ve written various articles on cases where column groups have effects (or not): “out of range” predicates, “is null” predicates, “histograms issues”, “statistics at both ends of the join”, and “multi-column indexes vs. column groups” are just some of the key areas. Assuming there are no reasons to stop a particular column group from working , we can look at the join from table A to table B: it’s a two-column join so if there’s some strong correlation between the id and id1 columns of these two tables then creating the two column groups (one at each end of the join) can make a difference to the optimizer’s calculations with the most likely effect that the cardinality estimate on the join will go up and, as a side effect the join order and join method may change.

If we then consider the join to table C – we note that it involves two columns from table C being joined to one column from table A and one from table B – so while we could create a column group on those two columns as the table C end of the join a column group is simply not possible at the “opposite end” of the join. This means that one end of the join may have a selectivity that is hugely reduced because the column group has quantified the correlation, but the selectivity at the other end is simply based on the two separate selectivities from a.id and b.id2 – and that’s likely to be larger than the selectivity of (c.id, c.id2), and the optimizer will choose the larger join selectivity hence producing a lower cardinality estimate.

This is where a collateral point appears – there is an opportunity for transitive closure that the human eye can see but the optimizer is not coded to manipulate. We have two predicates: “b.id = a.id” and “c.id = a.id”, but they can only both be true when “c.id = b.id”, so let’s replace “c.id = a.id” with “c.id = b.id” and the join predicate to table C becomes:

and     c.id   = b.id
and     c.id2  = b.id2

Both left hand sides reference table C, both right hand sides reference table B – so if we now create a column group on c(id, id2) and an additional column group on b(id, id2) then we may give Oracle some better information about this join as well. In fact, even if we create NO column groups at all this specific change may be enough to result in a change in the selectivity calculations with a subsequent change in cardinality estimate and execution plan.

Match_recognise – 2

Tue, 2018-03-06 01:59

In my previous post I presented a warning about the potential cost of sorting and the cost of failing to find a match after each pass of a long search. In a comment on that post Stew Ashton reminded me that the cost of repeatedly trying to find a match starting from “the next row down” could be less of a threat than the cost of “back-tracking” before moving to the next row down.

Taking the example from the previous posting to explain – the requirement was for customers who had executed a transaction in September but not October, and a match_recognize() clause suggested on the ODC (formerly OTN) database forum to implement this requirement was as follows:

match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
        pattern(x+ y* $) 
        define
                x as mth  = 9,
                y as mth != 10
);

In the blog post I raised the issue of an extreme case where there were 100,000 transactions for a single customer of which all but the last was a September transaction and the last was an October transaction. This would have two effects – first that we could have to sort 100,000 rows, including the cust_id that appeared in the “partition by” clause and the 1000-character padding column that was listed in the measures clause, leading to a huge dump to, and constant re-read of, the temporary tablespace; secondly that having failed to find a match starting from row 1 Oracle would go back to row 2 and try again, then to row 3, and so on.

The point that Stew Ashton made was that Oracle doesn’t just “go back to” row 2, it will be unwinding a stack, or reversing out a recursive descent to get there. What this means is that Oracle will fail as it reaches the October at row 100,000 and say “no more X rows, is this a Y row ? no”, backtrack to row 999,998 and say “what if I stop collecting X rows here and start looking for Y rows?”, so it reads row 999,999 as a Y row (since 9 != 10), then finds row 100,000 and fails the match. So it backtracks again to row 999,997 and says “what if I stop collecting X rows here and start looking for Y rows?”, and this time it finds identifies 999,998 and 999,999 as Y rows, then fails on row 100,000.

Remember, this is still part of the attempt to match the pattern starting at row 1 – and there are 999,996 more steps backwards still to go, and the further we go back the further we come forward again until we fail — and there are 999,998 steps we have to back-track before we start to consider a pattern starting are row 2..

To demonstrate the costs I’ve got three variants of the original query. First, the query as it was but limited to just 1,000 rows for a single customer; second a changed pattern that highlights the cost of trying to use back-tracking to match the pattern just once, starting from row 1 (the pattern doesn’t actually meet the original requirement because it would only find customers whose first transaction of the year was in September); finally a changed pattern that achieves the required result much more efficiently than the original (but still very slowly) by adding some human intelligence to the implementation.

Here’s version 1 – which took 257 CPU seconds to handle just 1,000 rows:

select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x+ y* $)
        define
                x as mth  = 9,
                y as mth != 10
);

You’ll see that I’ve included the “debug” functions of classifier() and match_number() in the SQL above – these are particularly useful with the options “all rows per match” and “with unmatched rows” when you’re trying to figure out why your match_recognize() clause is not producing the right results, so I’ve left them there purely for reference.

Then there’s a version where I’ve made the modification suggested by Stew Ashton to demonstrate the full cost of an attempt to match only if the pattern starts on the first row of the partition. This took just 0.83 CPU seconds to complete. This may sound fairly reasonable, but if you compare that to the time it might take simply to sort and walk once through 1,000 rows you’ll realise that it’s actually pretty expensive – and it’s not surprising that when we had to do the same thing 1,000 times (on a slowly decreasing set, of course, as we work our way down the partition) the original task took 257 CPU seconds.

select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(^ x+ y* $)
        define
                x as mth  = 9,
                y as mth != 10
);

You’ll notice the caret “^” at the start of the pattern – this means the pattern must start at the first row of the partition (just as the “$” means the pattern has to end at the end of the partition).

Finally, thinking of a better way of using match_recognize() for this requirement we realise that we know that November comes after October, and December comes after November so (in the context of our example) the predicate “!= 10” is equivalent to “> 10”. With this code change the original query took 0.82 CPU seconds.


select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x+ y* $)
        define
                x as mth  = 9,
                y as mth  > 10
);

In this case we still have to do a lot of back tracking, but each time we backtrack one step we then check just one row forward for the match to fail (9 is not greater than 10), whereas with the original if we have backtracked 750 steps (for example) we would then have to check 750 rows before we reached the October row for the match to fail.

Bottom line: back-tracking is a massive cost if you have to take a lot of steps backwards to the previous starting row; and you need the match to fail (or succeed) as fast as possible as you start stepping forward again.

Addendum

Since Stew Ashton had highlighted the omission in the previous blog post I passed him a copy of this post before publishing it, asking him to check whether there were any errors or omissions in the way I had described the work Oracle would do back tracking in this example. He said that he couldn’t think of anything to improve the explanation (though I will still claim responsibility for any errors, omissions, or ambiguities) and then suggested another, far more efficient, way of getting the required answer by (again) re-thinking the question before writing the code. His solution looks like this:


select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt nulls first
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x{1})
        define
                x as mth  = 9 and (
                             next(mth) is null or next(mth) > 10
                     )
)
;

The pattern here simply says: for the partition find the first “X”-row, but an X-row is defined as “month is september and either there are no more rows or the next row is after October”. You’ll notice that I’ve modified the “order by” clause to put nulls first – there are none in the sample data, but if there were this change to the order would ensure that for a row where “mth = 9″ the “next(mth)” could only be null if the current row were the last in the partition.

If you imagine walking through the pattern-matching process now, you start looking at rows and keep going until you reach the first September, and each time you find a September you check the next row to see if it’s past the end of partition, or a November or December; if it is you report the current row and move to the end of the partition, if it isn’t you just walk to the next row and repeat the process – you never back-track. Effectively the workload here is simply to sort then walk non-stop through the whole list – and Oracle even tells us that we are using this optimum strategy in the execution plan:


---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                       | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                |      |      1 |        |      0 |00:00:00.01 |     146 |       |       |          |
|   1 |  VIEW                                           |      |      1 |   1000 |      0 |00:00:00.01 |     146 |       |       |          |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|      |      1 |   1000 |      0 |00:00:00.01 |     146 |  1186K|   567K| 1054K (0)|
|   3 |    VIEW                                         |      |      1 |   1000 |   1000 |00:00:00.01 |     146 |       |       |          |
|   4 |     TABLE ACCESS FULL                           | T1   |      1 |   1000 |   1000 |00:00:00.01 |     146 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------

Operation 2 – the Match Recognize Sort operation – is reported as “deterministic finite auto”, which basically means the duration of the process is predictable because Oracle knows it is a straight end to end walk with no back-tracking. This is the ideal thing to see when you try to design code using match_recognize().

Conditional SQL – 5

Fri, 2018-03-02 06:49

Here’s a note that has been sitting around for more than 3 years (the draft date is Jan 2015), waiting for me to finish it off; and in that time we’ve got a new version of Oracle that changes the solution to the problem it presented. (I also managed to write “Conditional SQL –  6” in the intervening period !)

This posting started with a question on the OTN (now ODC) database forum about an execution plan used by 11.2.0.3.  Here’s a model to represent the data and the query:

rem
rem     Script:         null_plan_4.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2015
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment here to avoid format issue
)
select
        rownum  catentry_id,
        case
                when mod(rownum-1,100) > 0 then mod(rownum-1,100)
        end     member_id,
        case
                when trunc((rownum-1)/100) > 0 then trunc((rownum-1)/100)
        end     partnumber,
        rpad('x',100)   padding
from
        generator,
        generator
where
        rownum <= 100 * 100 -- > comment here to avoid format issue
;

execute dbms_stats.gather_table_stats(user,'catentry');

create unique index cat_i0 on catentry(member_id, partnumber) compress 1;
--  create        index cat_i1 on catentry(member_id, partnumber, 0) compress 1;
--  create        index cat_i2 on catentry(partnumber, member_id, 0) compress 1;

variable b1 number
variable b2 number
variable b3 number
variable b4 number

begin
        :b1 := 22;
        :b2 := 1;
        :b3 := 44;
        :b4 := 1;
end;
/

select
        catentry_id
from
        catentry
where
        (   partnumber= :b1
         or (0 = :b2 and partnumber is null)
        )
and     (    member_id= :b3
         or (0 = :b4 and member_id is null)
        )
;

select * from table(dbms_xplan.display_cursor);

==============================================================================

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |       |       |    23 (100)|          |
|*  1 |  TABLE ACCESS FULL| CATENTRY |     1 |    10 |    23   (5)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((("PARTNUMBER"=:B1 OR ("PARTNUMBER" IS NULL AND 0=:B2))
              AND ("MEMBER_ID"=:B3 OR ("MEMBER_ID" IS NULL AND 0=:B4))))

The question this example raised was, effectively: “Why didn’t Oracle use bind peeking to work out that the best plan for this query – with these input values – was an index range scan?”

The basic answer to this question is this: “No matter how clever Oracle can be with bind peeking and executions plans it has to produce an execution plan that will give the right answer whatever the supplied values might be.”

The OP was hoping that the optimizer would see :b2 and :b4 were arriving with the value 1, infer that “0 = 1” is always false, and reduce the query predicate to “partnumber =22 and member_id = 44” to produce the following plan:


----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |       |       |     2 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | CAT_I0   |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("PARTNUMBER"=22 AND "MEMBER_ID"=44)

But this plan could produce the wrong results if the next execution of the query supplied zeros for :b2 or :b4, so Oracle has to do something more generic. (Bear in mind that adaptive cursor sharing keeps reusing the same execution plan until it detects that the performance for some input values is bad; it doesn’t pre-emptively create new plans based on the incoming values – though in principle it might be possible for the Oracle developers to introduce code that can recognise special cases for predicates of the form “constant1 = constant2”).

If you review the SQL statement you can see that it’s clearly trying to allow the user to find data about member_ids and partnumbers where both, neither, or either value is allowed to be null: a couple of “if – then – else” conditions that should have been handled in the client code have been embedded in the code. As we have seen several times before if you can’t change the client code then you have to hope that Oracle will use some clever transformation to handle the query in sections.

We can infer from various details of the posting that the member_id and partnumber columns were both allowed to be null, so if we want to make sure that Oracle always uses an indexed access path to acquire data for this query we need to have an index which starts with those two columns and then has at least one column which is guaranteed to be non-null so, for example, we could simply drop the current index and replace it with one that has a fixed zero on the end:

create index cat_i1 on catentry(member_id, partnumber, 0) compress 1;

With my particular data set, query, and version of Oracle this didn’t make any difference to the plan. But then I thought about the data definition and realised (and checked) that the index had a terrible clustering_factor, so I dropped it and created it with the first two columns in the opposite order:

create index cat_i2 on catentry(member_id, partnumber, 0) compress 1;

Side note:
You’ll notice that I’ve replaced the original unique index with a non-unique index. This was necessary because there were many rows where both partnumber and member_id were null, so if I want to maintain the logic of the previous unique index I’ll need to add a unique constraint on (member_id, partnumber). It’s possible, of course, that in similar circumstances I might want both indexes – one for the uniqueness and to access the data using only one of the columns, the other to access the data using only the other column.

With this index in place, and unhinted, the plan I got from 11.2.0.4 changed to use concatenation with an impressive four-way split:


------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |       |       |    12 (100)|          |
|   1 |  CONCATENATION                |          |       |       |            |          |
|*  2 |   FILTER                      |          |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
|*  5 |   FILTER                      |          |       |       |            |          |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
|*  8 |   FILTER                      |          |       |       |            |          |
|*  9 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN          | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
|* 11 |   TABLE ACCESS BY INDEX ROWID | CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 12 |    INDEX RANGE SCAN           | CAT_I2   |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter((0=:B2 AND 0=:B4))
   4 - access("PARTNUMBER" IS NULL AND "MEMBER_ID" IS NULL)
       filter("MEMBER_ID" IS NULL)
   5 - filter(0=:B2)
   6 - filter((LNNVL("MEMBER_ID" IS NULL) OR LNNVL(0=:B4)))
   7 - access("PARTNUMBER" IS NULL AND "MEMBER_ID"=:B3)
       filter("MEMBER_ID"=:B3)
   8 - filter(0=:B4)
   9 - filter((LNNVL("PARTNUMBER" IS NULL) OR LNNVL(0=:B2)))
  10 - access("PARTNUMBER"=:B1 AND "MEMBER_ID" IS NULL)
  11 - filter(((LNNVL("MEMBER_ID" IS NULL) OR LNNVL(0=:B4)) AND
              (LNNVL("PARTNUMBER" IS NULL) OR LNNVL(0=:B2))))
  12 - access("PARTNUMBER"=:B1 AND "MEMBER_ID"=:B3)


To execute this plan the run-time engine works as follows:

  • Operation 2: If :b2 and :b4 are both zero we use the index to find the rows where member_id and partnumber are null (the filter “member_id is null” seems to be redundant)
  • Operation 5: if :b2 is null we use the index to find rows where the partnumber is null and the member_id is the supplied value (and if that’s null the access will immediately return zero rows because of the equality predicate), and discard any rows that have already been returned by operation 2
  • Operation 8: if :b4 is zero we will use the index to find rows where the partnumber is the supplied value (and if the partnumber is null, that access will immediately return zero rows because of the equality predicate) and the member_id is null, and discard any rows that have already been returned by operation 2.
  • Operations 11 and 12 will always run – using the index to find rows that match with equality on both the incoming member_id and partnumber, discarding any rows already returned by the previous operations, and obviously not matching any rows where either column “IS” null.

The critical feature of this plan, of course, is that we got it because we have given Oracle an efficient option to find the rows where both member_id and partnumber are null – and that allows the rest of the concatenation options to take place.

Hints and Upgrades

Interestingly, after the clue that 11g would happily use concatenation with a “good enough” index I went back to the example where I’d just added a zero to the existing index and checked to see what would happen if I added a /*+ use_concat */ hint (without any of the qualifying parameters that the hint can now use) and got the same path with concatenation. The fact that the path appeared wasn’t the interesting bit – see if you can spot what is the interesting bit:

------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |       |       |    12 (100)|          |
|   1 |  CONCATENATION                |          |       |       |            |          |
|*  2 |   FILTER                      |          |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
|*  5 |   FILTER                      |          |       |       |            |          |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
|*  8 |   FILTER                      |          |       |       |            |          |
|*  9 |    TABLE ACCESS BY INDEX ROWID| CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN          | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
|* 11 |   TABLE ACCESS BY INDEX ROWID | CATENTRY |     1 |    10 |     3   (0)| 00:00:01 |
|* 12 |    INDEX RANGE SCAN           | CAT_I1   |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Check the cost, and compare it with the cost of the full tablescan. The hinted path has a lower cost than the default path. I think this may be another case of an “unknowable” range scan being ignored in favour of a known alternative.

Finally, we get to today – when I tested the code against 12.1.0.2 and 12.2.0.1. Nothing exciting happened in 12.1.0.2 – the plans were just like the 11g plans, but here’s the plan I got in 12.2 with the “bad” indexing (original index with added zero column) without any hints in the SQL:


----------------------------------------------------------------------------------------------------------
| Id  | Operation			       | Name		 | Rows  | Bytes | Cost (%CPU)| Time	 |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		       |		 |	 |	 |     9 (100)| 	 |
|   1 |  VIEW				       | VW_ORE_37A7142B |     4 |    52 |     9   (0)| 00:00:01 |
|   2 |   UNION-ALL			       |		 |	 |	 |	      | 	 |
|   3 |    TABLE ACCESS BY INDEX ROWID	       | CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  4 |     INDEX UNIQUE SCAN		       | CAT_I1 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  5 |    FILTER			       |		 |	 |	 |	      | 	 |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN		       | CAT_I1 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  8 |    FILTER			       |		 |	 |	 |	      | 	 |
|   9 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|* 10 |      INDEX RANGE SCAN		       | CAT_I1 	 |     1 |	 |     1   (0)| 00:00:01 |
|* 11 |    FILTER			       |		 |	 |	 |	      | 	 |
|  12 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     3   (0)| 00:00:01 |
|* 13 |      INDEX RANGE SCAN		       | CAT_I1 	 |     1 |	 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      ...
      OR_EXPAND(@"SEL$1" (1) (2) (3) (4))
      ...
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("MEMBER_ID"=:B3 AND "PARTNUMBER"=:B1)
   5 - filter(0=:B4)
   7 - access("MEMBER_ID" IS NULL AND "PARTNUMBER"=:B1)
       filter(("PARTNUMBER"=:B1 AND LNNVL("MEMBER_ID"=:B3)))
   8 - filter(0=:B2)
  10 - access("MEMBER_ID"=:B3 AND "PARTNUMBER" IS NULL)
       filter(LNNVL("PARTNUMBER"=:B1))
  11 - filter((0=:B4 AND 0=:B2))
  13 - access("MEMBER_ID" IS NULL AND "PARTNUMBER" IS NULL)
       filter(("PARTNUMBER" IS NULL AND LNNVL("PARTNUMBER"=:B1) AND LNNVL("MEMBER_ID"=:B3)))

The full tablescan didn’t appear – but it wasn’t eliminated by concatenation but by the “new” 12.2  variant: “OR EXPANSION”. In this case the net effect is remarkably similar – we still have filter operations comparing :b2 and :b4 with zero, and we still have a scattering of lnnvl() function calls being used to discard rows we’ve already accessed, but the pattern is slightly different and we have a union all.

This change prompted me to go back to testing with just the original index (member_id, partnumber) … which took me back to the full tablescan until I added the hint /*+ or_expand */ to the query to get the following plan:


----------------------------------------------------------------------------------------------------------
| Id  | Operation			       | Name		 | Rows  | Bytes | Cost (%CPU)| Time	 |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		       |		 |	 |	 |    29 (100)| 	 |
|   1 |  VIEW				       | VW_ORE_37A7142B |     4 |    52 |    29   (4)| 00:00:01 |
|   2 |   UNION-ALL			       |		 |	 |	 |	      | 	 |
|   3 |    TABLE ACCESS BY INDEX ROWID	       | CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  4 |     INDEX UNIQUE SCAN		       | CAT_I0 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  5 |    FILTER			       |		 |	 |	 |	      | 	 |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN		       | CAT_I0 	 |     1 |	 |     1   (0)| 00:00:01 |
|*  8 |    FILTER			       |		 |	 |	 |	      | 	 |
|   9 |     TABLE ACCESS BY INDEX ROWID BATCHED| CATENTRY	 |     1 |    10 |     2   (0)| 00:00:01 |
|* 10 |      INDEX RANGE SCAN		       | CAT_I0 	 |     1 |	 |     1   (0)| 00:00:01 |
|* 11 |    FILTER			       |		 |	 |	 |	      | 	 |
|* 12 |     TABLE ACCESS FULL		       | CATENTRY	 |     1 |    10 |    23   (5)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("MEMBER_ID"=:B3 AND "PARTNUMBER"=:B1)
   5 - filter(0=:B4)
   7 - access("MEMBER_ID" IS NULL AND "PARTNUMBER"=:B1)
       filter(("PARTNUMBER"=:B1 AND LNNVL("MEMBER_ID"=:B3)))
   8 - filter(0=:B2)
  10 - access("MEMBER_ID"=:B3 AND "PARTNUMBER" IS NULL)
       filter(LNNVL("PARTNUMBER"=:B1))
  11 - filter((0=:B4 AND 0=:B2))
  12 - filter(("PARTNUMBER" IS NULL AND "MEMBER_ID" IS NULL AND LNNVL("PARTNUMBER"=:B1) AND
	      LNNVL("MEMBER_ID"=:B3)))

The plan shows “or expansion”, and highlights the most significant difference between concatenation and expansion – concatenation requires indexed access paths in all branches, or-expansion doesn’t.

At first sight this plan with its full tablescan at operation 12 might seem like a bad idea; but check operation 11, the guarding filter, the tablescan occurs only if both :b2 and :b4 are null. Perhaps that special condition is never supposed to appear, perhaps it’s going to do a lot of work whether or not you can use an index. The fact that you can now handle the original problem without adding or altering existing indexes – provided you can inject this or_expand hint – may be of significant benefit. (Of course, being able to recreate the original index with the columns in the reverse order would even avoid the necessity of worrying about the hint.)

tl;dr

Applications that push “if-then-else” decisions into the SQL and down to the optimizer are a pain in the backside; the performance problems they produce can sometimes be bypassed by the addition of extra indexes that might give you plans (possibly hinted) that report the use of the concatentation operation. In 12cR2 the optimizer has an improved strategy for damage limitation “Cost-based Or Expansion” that can produce very similar effects without the addition of extra indexes. These plans will report union all operations, referencing views with names like: VW_ORE_xxxxxxxxx.

When you next upgrade you may find a few cases where you can get rid of indexes that were only created to work around defective application coding strategies. You may also want to hunt down any code where you’ve added use_concat hints and see if they can be removed, or if they should be replaced by or_expand. since the former hint will disable the latter feature.

 

255 Columns

Wed, 2018-02-28 06:27

It’s the gift that keeps on giving – no matter how many problems you find there are always more waiting to be found. It’s been some time since I last wrote about tables with more than 255 columns, and I said then that there was more to come. In the last article I described how adding a few columns to a table, or updating a trailing column in a way that made the table’s used column count exceed 255, could result in some strange row-splitting behaviour – in this article I’m going to look at a critical side effect of that behaviour.

We’ll start with a simple model and a question – I’ll create a table with a thousand rows of data, then I’ll add a few columns to that table and update the last column in every row and check the undo and redo generated by the update.  Eventually I’m going to need a table with more than 255 columns so the script to create the table is rather long and I’ve posted it at the end of the article in case you want to experiment – in the following text I’ve omitted a few hundred lines of column declarations and values.


rem
rem     Script:         wide_table_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(
        col000,
        col001,
        col002,
        col003,
...
        col199,
/*
        col200,
        col201,
        col202,
        col203,
        col204,
        col205,
        col206,
        col207,
        col208,
        col209,
*/
        col210,
        col220,
...
        col247,
        col248,
        col249
)
as
with generator as (
        select 
                rownum id
        from dual
        connect by
                level <= 1e3 -- > comment to avoid WordPress format issue
)
select
        lpad(000,10,'0'),
        lpad(001,10,'0'),
        lpad(002,10,'0'),
        lpad(003,10,'0'),
...
        lpad(247,10,'0'),
        lpad(248,10,'0'),
        lpad(249,10,'0')
from
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add(
        col250 varchar2(10),
        col251 varchar2(10),
        col252 varchar2(10),
        col253 varchar2(10),
        col254 varchar2(10),
        col255 varchar2(10),
        col256 varchar2(10),
        col257 varchar2(10),
        col258 varchar2(10),
        col259 varchar2(10)
)
;

update t1 set col259 = lpad('259',10,'0');
commit;


The table I’ve created has columns named from col000 to col249 populated with a text matching the column’s numeric id – a total of 250 columns, except that for the first part of the test I’ve commented out the creation and population of 10 of those columns, giving me a total of 240 columns. Then I’ve added 10 more columns and updated the last of those 10. So, for the first part of the test, I’ve grown my table from 240 used columns to 250 used columns. Here are a few critical stats for that update from the session statistics view (v$sesstat joined to v$statname):

 

Name                                                 Value
----                                                 -----
db block gets                                        1,023
consistent gets                                        555
db block changes                                     2,012
redo entries                                           907
redo size                                          262,692
undo change vector size                             76,052
table scan rows gotten                               1,002
table scan blocks gotten                               501
HSC Heap Segment Block Changes                       1,000

That’s pretty much the sort of thing we might expect. For a small update to a row it’s reasonable to see 250 – 300 bytes of redo of which about half is the redo for the undo. We can see that we’ve scanned 1,000 rows and made 2,000 block changes (one to the table block and one to an undo block for each row in the table). The table was 500 blocks of data (the avg_row_len is about 2640 bytes from 240 columns at 10 bytes + a length byte, which makes two rows per block with lots of spare space at the default 10 pctfree). You might been expecting the number of redo entries to be a close match to the number of rows but it’s a little short because the redo records for the first few updates would have been stored in private redo and turned into a single large redo entry.

So what do the stats look like when we start with 250 columns and grow to 260 columns, breaking through the dreaded 255 barrier ?


Name                                                 Value
----                                                 -----
db block gets                                        9,503
consistent gets                                      1,894
db block changes                                     9,384
redo size                                        8,110,584
redo entries                                         5,778
undo change vector size                          3,780,260
table scan rows gotten                               1,002
table scan blocks gotten                               501
HSC Heap Segment Block Changes                       3,000

Such a simple change – with such a large impact.

The average undo is now nearly 3.5KB per row (and the rows were only about 2,860 bytes each anyway), the number of redo entries is up to 6 times the original, we’re averaging 3 “HSC Heap Segment Block Changes” per row instead of 1 and in total we’ve managed to do an extra 7,000 db block changes overall.

To get an idea of what was going on behind the scenes I dumped the redo log file. For the first part of the test most of the redo entries consisted of a pair of redo change vectors with OP codes 5.1 (modify undo block) and 11.5 (update row piece). The 5.1 corresponded to the undo needed to reverse out the effect of its paired 11.5 and both change vectors were pretty small.

For the second part of the test I found a frequently repeated sequence of three consecutive redo records of paired redo vectors: (5.1, 11.4), (5.1, 11.2) and (5.1, 11.6). Again each 5.1 corresponds to the undo needed to reverse out the effect of its associated 11.x, and in this case the three “table” (level 11) OP codes are, in order: “lock row piece”, “insert row piece”, “overwrite row piece”. These 3 pairs occured 1,000 times each, accounting for roughly 3,000 of the redo entries reported.

On top of this there were 2,500 redo records holding redo change vectors with op code 13.22 (“State change on level 1 bitmap block”), and a few hundred , with op code 13.21 (“Format page table block”) with just a few others bringing the total up to the 5,800 redo entries reported. (You might note that the last OP code is a clue that we added quite a lot of new blocks to the table as we executed the update – in fact the number of used table blocks grew by about 50%.)

We also had 500 redo change vectors of type 5.2 (“get undo header”). This number was significantly more than in the first part of the test because we had a lot more undo block changes (with bigger undo change vectors) which means we used a lot more undo blocks, and each time we move to a new undo block we update our transaction slot in the undo header with the current undo block address. I mentioned the pairing of 5.1 and 11.6 above – in fact 50% of those records recorded just two changes (5.1, 11.6) the other 50% recorded three changes (5.2, 5.1, 11.6) – in effect every other row update resulted in Oracle demanding a new undo block.

I’ll explain in a moment why we have a very large number of “State change on level 1 bitmap block”; first let’s examine the key steps of how Oracle is processing a single row update – the sequence of 11.4, 11.2, 11.6:

  • 11.4: lock the head row piece – this seems to happen when the row is in (or going to turn into) multiple pieces; presumably because the piece that is going to change might not be the head piece. This is a tiny operation that generates undo and redo in the order of tens of bytes.
  • 11.2: our code extends the row to 260 columns, which means Oracle has to split it into two pieces of 5 columns and 255 columns respectively – so one of those row-pieces has to cause an insert to take place. Inserting a row may require a lot of redo, of course, but the undo for a (table-only) insert is, again, tiny.
  • 11.6: When Oracle has to split a wide row (>255 columns) it counts columns from the end, so the first row piece of our row is 5 columns and the second row piece (which is the one inserted by the 11.2 operation) is 255 columns. This means we have to overwrite the original row piece with a much shorter row piece. So we’re replacing 2,750 bytes (250 columns) with 55 bytes (5 columns), which means we have to write the contents of the “deleted” 250 columns to the undo tablespace – and that’s where most of the huge volume of undo comes from.

There are two extra points to note about the way Oracle handles the insert/overwrite steps. The length of our row exaggerates the effect, of course, but when we insert the ending 255 columns the block they go to is probably going to change from nearly empty to half full, or half full to full – which means its bitmap entry has to be updated; similarly when the initial 250 columns is overwritten with just 5 columns a huge amount of free space will appear in the block which, again, means that the block’s bitmap entry has to be updated. This gives us a good idea of why we see so many 13.22 (“L1 bitmap state change”)redo change vectors.

The second point is that the numbers still don’t add up. Allowing a couple of hundred bytes of undo per row for the lock row and insert row, then 2,750 plus a few more for the overwrite, I’ve accounted for about 3,000 bytes per row updated – which leaves me short by about 800 bytes per row.  If I dump undo blocks I can see that the undo change vector for the overwrite is actually 3,628 bytes long rather than being “slightly over” the 2,750 for the before image of the overwritten row. Strangely I can see a couple of hundred bytes of what looks like damage in the row image, and there’s also a couple of hundred bytes of what looks like garbage (but probably isn’t) after the row image, but I’ve got no idea why there’s so much excess data in the record.

One day I (or someone else) may get around to looking very closely at why that particular undo record is a lot bigger than an observer might expect – but at least we can see that the space is there, and even if some of that volume could be made to disappear the problem of the massive undo relating to the overwrite and Oracle’s odd strategy of counting columns from the end is still going to be present, and there are probably some occasions when you need to know about this.

tl;dr

Once the number of columns in a table goes over 255 then a simple update to a “trailing” null column (i.e. one past the current end of the row) is likely to generate a much larger volume of undo and redo than you might expect. In particular the size of the undo record is likely to be equivalent to the volume of the last 255 columns of the table – and then a large fraction more.

The reason why this particular anomaly came to my attention is because a client had a problem upgrading an application that required them to add a few columns to a table and then update a couple of them. The size of the undo tablespace was 300 GB, the update (the only thing running at the time) should have added about 30 bytes to the length of each row, the update should have affected 250 million rows. The process crashed after a few hours “ORA-30036: unable to extend segment by %s in undo tablespace ‘%s'”. Even allowing for a “reasonable” overhead it seemed rather surprising that Oracle needed more than 1,200 bytes of undo space per row – but then I found the table was defined with 350 columns.

Solutions for this user:  it’s an upgrade that’s allowed to take quite a long time, and the nature of the update is such that it would be possible to update in batches, committing after each batch.  It would also be nice to review how the table was used to see if it could be rebuilt with a different column order to move all the unused columns to the end of the row – with a little luck the result table might find almost all the rows fitting into a single row piece, even after the upgrade.

Footnote

If you want to experiment, here’s the whole script to create the table, insert some rows, then add a few more columns and update one of them. It’s very long, and not in the least bit exciting, but it may save you a little typing time if you want to use it.


create table t1(
        col000,
        col001,
        col002,
        col003,
        col004,
        col005,
        col006,
        col007,
        col008,
        col009,
        col010,
        col011,
        col012,
        col013,
        col014,
        col015,
        col016,
        col017,
        col018,
        col019,
        col020,
        col021,
        col022,
        col023,
        col024,
        col025,
        col026,
        col027,
        col028,
        col029,
        col030,
        col031,
        col032,
        col033,
        col034,
        col035,
        col036,
        col037,
        col038,
        col039,
        col040,
        col041,
        col042,
        col043,
        col044,
        col045,
        col046,
        col047,
        col048,
        col049,
        col050,
        col051,
        col052,
        col053,
        col054,
        col055,
        col056,
        col057,
        col058,
        col059,
        col060,
        col061,
        col062,
        col063,
        col064,
        col065,
        col066,
        col067,
        col068,
        col069,
        col070,
        col071,
        col072,
        col073,
        col074,
        col075,
        col076,
        col077,
        col078,
        col079,
        col080,
        col081,
        col082,
        col083,
        col084,
        col085,
        col086,
        col087,
        col088,
        col089,
        col090,
        col091,
        col092,
        col093,
        col094,
        col095,
        col096,
        col097,
        col098,
        col099,
        col100,
        col101,
        col102,
        col103,
        col104,
        col105,
        col106,
        col107,
        col108,
        col109,
        col110,
        col111,
        col112,
        col113,
        col114,
        col115,
        col116,
        col117,
        col118,
        col119,
        col120,
        col121,
        col122,
        col123,
        col124,
        col125,
        col126,
        col127,
        col128,
        col129,
        col130,
        col131,
        col132,
        col133,
        col134,
        col135,
        col136,
        col137,
        col138,
        col139,
        col140,
        col141,
        col142,
        col143,
        col144,
        col145,
        col146,
        col147,
        col148,
        col149,
        col150,
        col151,
        col152,
        col153,
        col154,
        col155,
        col156,
        col157,
        col158,
        col159,
        col160,
        col161,
        col162,
        col163,
        col164,
        col165,
        col166,
        col167,
        col168,
        col169,
        col170,
        col171,
        col172,
        col173,
        col174,
        col175,
        col176,
        col177,
        col178,
        col179,
        col180,
        col181,
        col182,
        col183,
        col184,
        col185,
        col186,
        col187,
        col188,
        col189,
        col190,
        col191,
        col192,
        col193,
        col194,
        col195,
        col196,
        col197,
        col198,
        col199,
        col200,
        col201,
        col202,
        col203,
        col204,
        col205,
        col206,
        col207,
        col208,
        col209,
        col210,
        col211,
        col212,
        col213,
        col214,
        col215,
        col216,
        col217,
        col218,
        col219,
        col220,
        col221,
        col222,
        col223,
        col224,
        col225,
        col226,
        col227,
        col228,
        col229,
        col230,
        col231,
        col232,
        col233,
        col234,
        col235,
        col236,
        col237,
        col238,
        col239,
        col240,
        col241,
        col242,
        col243,
        col244,
        col245,
        col246,
        col247,
        col248,
        col249
)
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e3 -- > comment to avoid WordPress format issue
)
select
        lpad(000,10,'0'),
        lpad(001,10,'0'),
        lpad(002,10,'0'),
        lpad(003,10,'0'),
        lpad(004,10,'0'),
        lpad(005,10,'0'),
        lpad(006,10,'0'),
        lpad(007,10,'0'),
        lpad(008,10,'0'),
        lpad(009,10,'0'),
        lpad(010,10,'0'),
        lpad(011,10,'0'),
        lpad(012,10,'0'),
        lpad(013,10,'0'),
        lpad(014,10,'0'),
        lpad(015,10,'0'),
        lpad(016,10,'0'),
        lpad(017,10,'0'),
        lpad(018,10,'0'),
        lpad(019,10,'0'),
        lpad(020,10,'0'),
        lpad(021,10,'0'),
        lpad(022,10,'0'),
        lpad(023,10,'0'),
        lpad(024,10,'0'),
        lpad(025,10,'0'),
        lpad(026,10,'0'),
        lpad(027,10,'0'),
        lpad(028,10,'0'),
        lpad(029,10,'0'),
        lpad(030,10,'0'),
        lpad(031,10,'0'),
        lpad(032,10,'0'),
        lpad(033,10,'0'),
        lpad(034,10,'0'),
        lpad(035,10,'0'),
        lpad(036,10,'0'),
        lpad(037,10,'0'),
        lpad(038,10,'0'),
        lpad(039,10,'0'),
        lpad(040,10,'0'),
        lpad(041,10,'0'),
        lpad(042,10,'0'),
        lpad(043,10,'0'),
        lpad(044,10,'0'),
        lpad(045,10,'0'),
        lpad(046,10,'0'),
        lpad(047,10,'0'),
        lpad(048,10,'0'),
        lpad(049,10,'0'),
        lpad(050,10,'0'),
        lpad(051,10,'0'),
        lpad(052,10,'0'),
        lpad(053,10,'0'),
        lpad(054,10,'0'),
        lpad(055,10,'0'),
        lpad(056,10,'0'),
        lpad(057,10,'0'),
        lpad(058,10,'0'),
        lpad(059,10,'0'),
        lpad(060,10,'0'),
        lpad(061,10,'0'),
        lpad(062,10,'0'),
        lpad(063,10,'0'),
        lpad(064,10,'0'),
        lpad(065,10,'0'),
        lpad(066,10,'0'),
        lpad(067,10,'0'),
        lpad(068,10,'0'),
        lpad(069,10,'0'),
        lpad(070,10,'0'),
        lpad(071,10,'0'),
        lpad(072,10,'0'),
        lpad(073,10,'0'),
        lpad(074,10,'0'),
        lpad(075,10,'0'),
        lpad(076,10,'0'),
        lpad(077,10,'0'),
        lpad(078,10,'0'),
        lpad(079,10,'0'),
        lpad(080,10,'0'),
        lpad(081,10,'0'),
        lpad(082,10,'0'),
        lpad(083,10,'0'),
        lpad(084,10,'0'),
        lpad(085,10,'0'),
        lpad(086,10,'0'),
        lpad(087,10,'0'),
        lpad(088,10,'0'),
        lpad(089,10,'0'),
        lpad(090,10,'0'),
        lpad(091,10,'0'),
        lpad(092,10,'0'),
        lpad(093,10,'0'),
        lpad(094,10,'0'),
        lpad(095,10,'0'),
        lpad(096,10,'0'),
        lpad(097,10,'0'),
        lpad(098,10,'0'),
        lpad(099,10,'0'),
        lpad(100,10,'0'),
        lpad(101,10,'0'),
        lpad(102,10,'0'),
        lpad(103,10,'0'),
        lpad(104,10,'0'),
        lpad(105,10,'0'),
        lpad(106,10,'0'),
        lpad(107,10,'0'),
        lpad(108,10,'0'),
        lpad(109,10,'0'),
        lpad(110,10,'0'),
        lpad(111,10,'0'),
        lpad(112,10,'0'),
        lpad(113,10,'0'),
        lpad(114,10,'0'),
        lpad(115,10,'0'),
        lpad(116,10,'0'),
        lpad(117,10,'0'),
        lpad(118,10,'0'),
        lpad(119,10,'0'),
        lpad(120,10,'0'),
        lpad(121,10,'0'),
        lpad(122,10,'0'),
        lpad(123,10,'0'),
        lpad(124,10,'0'),
        lpad(125,10,'0'),
        lpad(126,10,'0'),
        lpad(127,10,'0'),
        lpad(128,10,'0'),
        lpad(129,10,'0'),
        lpad(130,10,'0'),
        lpad(131,10,'0'),
        lpad(132,10,'0'),
        lpad(133,10,'0'),
        lpad(134,10,'0'),
        lpad(135,10,'0'),
        lpad(136,10,'0'),
        lpad(137,10,'0'),
        lpad(138,10,'0'),
        lpad(139,10,'0'),
        lpad(140,10,'0'),
        lpad(141,10,'0'),
        lpad(142,10,'0'),
        lpad(143,10,'0'),
        lpad(144,10,'0'),
        lpad(145,10,'0'),
        lpad(146,10,'0'),
        lpad(147,10,'0'),
        lpad(148,10,'0'),
        lpad(149,10,'0'),
        lpad(150,10,'0'),
        lpad(151,10,'0'),
        lpad(152,10,'0'),
        lpad(153,10,'0'),
        lpad(154,10,'0'),
        lpad(155,10,'0'),
        lpad(156,10,'0'),
        lpad(157,10,'0'),
        lpad(158,10,'0'),
        lpad(159,10,'0'),
        lpad(160,10,'0'),
        lpad(161,10,'0'),
        lpad(162,10,'0'),
        lpad(163,10,'0'),
        lpad(164,10,'0'),
        lpad(165,10,'0'),
        lpad(166,10,'0'),
        lpad(167,10,'0'),
        lpad(168,10,'0'),
        lpad(169,10,'0'),
        lpad(170,10,'0'),
        lpad(171,10,'0'),
        lpad(172,10,'0'),
        lpad(173,10,'0'),
        lpad(174,10,'0'),
        lpad(175,10,'0'),
        lpad(176,10,'0'),
        lpad(177,10,'0'),
        lpad(178,10,'0'),
        lpad(179,10,'0'),
        lpad(180,10,'0'),
        lpad(181,10,'0'),
        lpad(182,10,'0'),
        lpad(183,10,'0'),
        lpad(184,10,'0'),
        lpad(185,10,'0'),
        lpad(186,10,'0'),
        lpad(187,10,'0'),
        lpad(188,10,'0'),
        lpad(189,10,'0'),
        lpad(190,10,'0'),
        lpad(191,10,'0'),
        lpad(192,10,'0'),
        lpad(193,10,'0'),
        lpad(194,10,'0'),
        lpad(195,10,'0'),
        lpad(196,10,'0'),
        lpad(197,10,'0'),
        lpad(198,10,'0'),
        lpad(199,10,'0'),
        lpad(200,10,'0'),
        lpad(201,10,'0'),
        lpad(202,10,'0'),
        lpad(203,10,'0'),
        lpad(204,10,'0'),
        lpad(205,10,'0'),
        lpad(206,10,'0'),
        lpad(207,10,'0'),
        lpad(208,10,'0'),
        lpad(209,10,'0'),
        lpad(210,10,'0'),
        lpad(211,10,'0'),
        lpad(212,10,'0'),
        lpad(213,10,'0'),
        lpad(214,10,'0'),
        lpad(215,10,'0'),
        lpad(216,10,'0'),
        lpad(217,10,'0'),
        lpad(218,10,'0'),
        lpad(219,10,'0'),
        lpad(220,10,'0'),
        lpad(221,10,'0'),
        lpad(222,10,'0'),
        lpad(223,10,'0'),
        lpad(224,10,'0'),
        lpad(225,10,'0'),
        lpad(226,10,'0'),
        lpad(227,10,'0'),
        lpad(228,10,'0'),
        lpad(229,10,'0'),
        lpad(230,10,'0'),
        lpad(231,10,'0'),
        lpad(232,10,'0'),
        lpad(233,10,'0'),
        lpad(234,10,'0'),
        lpad(235,10,'0'),
        lpad(236,10,'0'),
        lpad(237,10,'0'),
        lpad(238,10,'0'),
        lpad(239,10,'0'),
        lpad(240,10,'0'),
        lpad(241,10,'0'),
        lpad(242,10,'0'),
        lpad(243,10,'0'),
        lpad(244,10,'0'),
        lpad(245,10,'0'),
        lpad(246,10,'0'),
        lpad(247,10,'0'),
        lpad(248,10,'0'),
        lpad(249,10,'0')
from
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select
        avg_row_len, num_rows,  blocks,
        num_rows / trunc(8000/avg_row_len) estimated_blocks
from
        user_tables
where
        table_name = 'T1'
;

prompt  =================
prompt  Add a few columns
prompt  =================

alter table t1 add(
        col250 varchar2(10),
        col251 varchar2(10),
        col252 varchar2(10),
        col253 varchar2(10),
        col254 varchar2(10),
        col255 varchar2(10),
        col256 varchar2(10),
        col257 varchar2(10),
        col258 varchar2(10),
        col259 varchar2(10)
)
;

-- alter system switch logfile;

update t1 set col259 = lpad('259',10,'0');
commit;

-- execute dump_log

P.S. if you do investigate and solve the question of the excess space in the undo record, and the odd content in the row “before image” then do let me know. (Hint: part of the excess may be a “null columns” map – but that still leaves plenty to account for.)

Match_recognize

Mon, 2018-02-26 08:59

In the spirit of Cary Millsap’s comment: “The fastest way to do anything is to not do it at all”, here’s my take (possibly not an original one) on solving problems:

“The best time to solve a problem is before it has happened.”

I spend quite a lot of my “non-contact” time thinking about boundary cases, feature collisions, contention issues, and any other things that could go wrong when you start to implement real systems with (new) Oracle features. The benefit of doing this, of course, is that when I’m looking at a client’s system I can often solve problems because I recognise symptoms that I’ve previously created “in the lab”. The strange thing about this is that there have been times when I’ve pushed Oracle to a breaking point, documented it, and then dismissed the threat because “no one would do that in real life” only to find that someone has done it in real life.

All this is just a preamble to a demonstration of a threat with a terrific feature that is just beginning to gain greater acceptance as a solution to some interesting problems – and the demonstration is going to exaggerate the problem to a level that (probably) won’t appear in a production. The driving example appeared as a question on the OTN/ODC database forum:

“I need customers who have done a transaction in September but not in October.”

There are obviously many ways to address this type of requirement (my first thought was to use the MINUS operator), and a few questions you might ask before trying to address it, but the OP had supplied some data to play which consisted of just a few rows of a table with three columns and some data restricted to just one year, and one solution offered was a very simple query using match_recognize():


CREATE TABLE TEST_TABLE   
  ( T_ID NUMBER, -- trans-id  
    CUST_ID NUMBER,   
    TRANS_DT DATE  
  ) ;  
                  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (1,100,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (2,100,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (3,200,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (4,300,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (5,400,to_date('12-JAN-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (6,500,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (7,500,to_date('12-MAR-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (8,600,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (9,600,to_date('12-JUL-17','DD-MON-RR'));  

commit;

select * from test_table
match_recognize
(
  partition by cust_id
  order by trans_dt
  pattern( x+ y* $)
  define
    x as extract(month from trans_dt)  = 9,
    y as extract(month from trans_dt) != 10
);
 
   CUST_ID
----------
       200
       600
      

The obvious benefit of this solution over a solution involving a set-wise MINUS is that it need only scan the data set once (whereas the MINUS strategy will be scanning it twice with a select distinct in each scan) – but it’s a solution that is likely to be unfamiliar to many people and may need a little explanation.

The partition by cust_id order by trans_dt means we sort the data by those two columns, breaking on cust_id. Then for each cust_id we walk through the data looking for a pattern which is defined as: “one or more rows where the month is september followed by zero or more rows where the month is NOT october followed by the end of the set for the customer”. The SQL leaves many details to default so the result set is just the cust_id column and only one row per occurrence of the pattern (which, given the data set, can occur at most once per customer).

For a cust_id that shows a matching pattern the work we will have done is:

  • Walk through rows for Jan to Aug until we reach the first September – which is the start of pattern
  • Keep on walking through to the last of the Septembers – which is a partial match
  • One of
  • Walk through zero rows of November and December and reach the end of cust_id
  • Walk through one or more rows of November and/or December then reach the end of cust_id
  • Record the end of pattern by reporting one row
  • Move on to next cust_id

The excitement starts when we think about a cust_id that doesn’t have a matching pattern – and for that I’m going to generate a new, extreme, data set.


rem
rem     Script:         match_recognize_07.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        99                              cust_id,
        to_date('01-Sep-2017')          trans_dt,
        lpad(rownum,1000,'0')           padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;

update t1
set
        trans_dt = to_date('01-Oct-2017','dd-mon-yyyy')
where
        rownum = 1
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select  *
from    (
        select 
                t1.*,
                extract(year from trans_dt) yr, 
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
        pattern( x+  y*  $ )
        define
                x as mth = 9,
                y as mth != 10
);

I’ve moved the calculation of month number from the define clause into an in-line view purely to make the match_recognize() clause a little tidier.

I’ve created a table with just one customer with 100,000 transactions on 1st September 2017, then I’ve updated one row from September to October. Thanks to that one row Oracle is not going to be able to find the requested pattern. I’ve added a padding column of 1,000 characters to the table and included it in the measures that I want to select, so Oracle will have to sort roughly 100MB of data (100,000 rows at roughly 1KB per row) before it starts walking the data to find matches – and, though it’s not visible in the script, the workarea settings mean the session won’t be allowed to expand its PGA to accommodate the whole 100MB.

Test 1 – comment out the update and see how long it takes to produce a result: 0.67 seconds, and the padding value reported was the last one from the pattern.
Test 2 – put the update back in place and try again:

After running for 46 seconds with no result and interrupting the query these are some figures from a snapshot of the session stats:

Name                                                 Value
----                                                 -----
CPU used when call started                           3,662
DB time                                              3,711
user I/O wait time                                   1,538
consistent gets                                     14,300
physical reads direct                            1,298,939
physical read IO requests                          736,478
physical read bytes                         10,640,908,288      
physical writes                                     25,228
physical writes direct                              25,228
physical reads direct temporary tablespace       1,298,939
physical writes direct temporary tablespace         25,228
table scan rows gotten                             100,000
table scan blocks gotten                            14,286

  • I’ve scanned a table of 14,286 blocks to find 100,000 rows.
  • I’ve sorted and spilled to disc, using roughly 25,000 blocks of direct path writes and reads to do the sort.
  • Then I’ve spend the rest of the time burning up CPU and reading 1.27 million blocks from the temporary tablespace trying to find a match

The way that basic pattern matching works on a match failure is to go back to the row after the one where the current match attempt started, and begin all over again. So in this example, after dumping 100MB of Septembers to temp Oracle started at row 1, read 999,999 rows, then found the October that failed the match; so it went to row 2, read 999,998 rows, then found the October that failed the match; so it went to row 3 and so on. Every time it went back to (nearly) the beginning it had to start re-reading that 100,000 rows from temp because the session wasn’t allowed to keep the whole 100MB in memory.

You need to avoid defining a pattern that has to scan large volumes of data to identify a single occurrence of the pattern if the matching process is likely to fail. Even if you can keep the appropriate volume of data in memory for the entire time and avoid a catastrophic volume of reads from the temporary tablespace you can still see a huge amount of CPU being used to process the data – when I reduced the table from 100,000 rows to 10,000 rows it still took me 99 CPU seconds to run the query.

tl;dr

Match_recognize() is a terrific tool, but you must remember two important details about the default behaviour when you think about using it:

  • You will sort a volume of data that is the number of input rows multiplied but the total length of the measures/partition output.
  • If you have a long sequence of rows that ends up failing to match a pattern Oracle goes back to the row after the start of the previous match attempt.

With the usual proviso that “large”, “small” etc. are all relative: keep the data volume small, and try to define patterns that will be short  runs of rows.

Do note, however, that I engineered this example to produce a catastrophe. There are many non-default actions you can choose to minimise the workload you’re likely to produce with match_recognize(), and if you just spare a little time to think about worst case events you probably won’t need to face a scenario like this in a real production environment.

See also:

Part 6 (which includes a list of earlier installments) of an introductory series to match_recognize() by Keith Laker.

A pdf file of Keith Laker’s presentation on match_recognize(), including some technical implementation details.

 

Huge Pages

Thu, 2018-02-22 03:03

A useful quick summary from Neil Chandler replying to a thread on Oracle-L:

Topic: RAC install on Linux

You should always be using Hugepages.

They give a minor performance improvement and a significant memory saving in terms of the amount of memory needed to handle the pages – less Transaction Lookaside Buffers, which also means less TLB misses (which are expensive).

You are handling the memory chopped up into 2MB pieces instead of 4K. But you also have a single shared memory TLB for Hugepages.

The kernel has less work to do, bookkeeping fewer pointers in the TLB.

You also have contiguous memory allocation and it can’t be swapped.

If you are having problems with Hugepages, you have probably overallocated them (I’ve seen this several times at clients so it’s not uncommon). Hugepages can *only* be used for your SGA’s. All of your SGA’s should fit into the Hugepages and that should generally be no more than about 60% of the total server memory (but there are exceptions), leaving plenty of “normal” memory (small pages) for PGA , O/S and other stuff like monitoring agendas.

As an added bonus, AMM can’t use Hugepages, so your are forced to use ASMM. AMM doesn’t work well and has been kind-of deprecated by oracle anyway – dbca won’t let you setup AMM if the server has more than 4GB of memory.

 

Interval Partition Problem

Wed, 2018-02-21 02:40

Assume you’ve got a huge temporary tablespace, there’s plenty of space in your favourite tablespace, you’ve got a very boring, simple table you want to copy and partition, and no-one and nothing is using the system. Would you really expect a (fairly) ordinary “create table t2 as select * from t1” to end with an Oracle error “ORA-1652: unable to extend temp segment by 128 in tablespace TEMP” . That’s the temporary tablespace that’s out of space, not the target tablespace for the copy.

Here’s a sample data set (tested on 11.2.0.4 and 12.1.0.2) to demonstrate the surprise – you’ll need about 900MB of space by the time the entire model has run to completion:

rem
rem     Script:         pt_interval_threat_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

column tomorrow new_value m_tomorrow
select to_char(sysdate,'dd-mon-yyyy') tomorrow from dual;

create table t1
as
with g as (
        select rownum id
        from dual
        connect by level <= 2e3
)
select
        rownum id,
        trunc(sysdate) + g2.id  created,
        rpad('x',50)            padding
from
        g g1,
        g g2
where
        rownum  comment to avoid WordPress format mess
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

I’ve created a table of 4 million rows, covering 2,000 dates out into the future starting from sysdate+1 (tomorrow). As you can see there’s nothing in the slightest bit interesting, unusual, or exciting about the data types and content of the table.

I said my “create table as select” was fairly ordinary – but it’s actually a little bit out of the way because it’s going to create a partitioned copy of this table.


execute snap_my_stats.start_snap

create table t2
partition by range(created)
interval(numtodsinterval(7, 'day'))
(
        partition p_start       values less than (to_date('&m_tomorrow','dd-mon-yyyy'))
)
storage(initial 1M)
nologging
as
select
        *
from
        t1
;

set serveroutput on
execute snap_my_stats.end_snap

I’ve created the table as a range-partitioned table with an interval() declared. Conveniently I need only mention the partitioning column by name in the declaration, rather than listing all the columns with their types, and I’ve only specified a single starting partition. Since the interval is 7 days and the data spans 2,000 days I’m going to end up with nearly 290 partitions added.

There’s no guarantee that you will see the ORA-01652 error when you run this test – the data size is rather small and your machine may have sufficient other resources to hide the problem even when you’re looking for it – but the person who reported the problem on the OTN/ODC database forum was copying a table of 2.5 Billion rows using about 200 GB of storage, so size is probably important, hence the 4 million rows as a starting point on my small system.

Of course, hitting an ORA-01652 on TEMP when doing a simple “create as select” is such an unlikely sounding error that you don’t necessarily have to see it actually happen; all you need to see (at least as a starting point in a small model) is TEMP being used unexpectedly so, for my first test (on 11.2.0.4), I’ve included some code to calculate and report changes in the session stats – that’s the calls to the package snap_my_stats. Here are some of the more interesting results:


---------------------------------
Session stats - 20-Feb 16:58:24
Interval:-  14 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
table scan rows gotten                                               4,000,004
table scan blocks gotten                                                38,741

session pga memory max                                             181,338,112

sorts (rows)                                                         2,238,833

physical reads direct temporary tablespace                              23,313
physical writes direct temporary tablespace                             23,313

The first couple of numbers show the 4,000,000 rows being scanned from 38,741 table blocks – and that’s not a surprise. But for a simple copy the 181MB of PGA memory we’ve acquired is a little surprising, though less so when we see that we’ve sorted 2.2M rows, and then ended up spilling 23,313 blocks to the temporary tablespace. But why are we sorting anything – what are those rows ?

My first thought was that there was a bug in some recursive SQL that was trying to define or identify dynamically created partitions, or maybe something in the space management code trying to find free space, so the obvious step was to enable extended tracing and look for any recursive statements that were running a large number of times or doing a lot of work. There weren’t any – and the trace file (particularly the detailed wait events) suggested the problem really was purely to do with the CTAS itself; so I ran the code again enabling events 10032 and 10033 (the sort traces) and found the following:


---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             2140000
Output records                            2140000
Disk blocks 1st pass                      22292
Total disk blocks used                    22294
Total number of comparisons performed     0
Temp segments allocated                   1
Extents allocated                         175
Uses version 1 sort
Uses asynchronous IO

One single operation had resulted in Oracle sorting 2.14 million rows (but not making any comparisons!) – and the only table in the entire system with enough rows to do that was my source table! Oracle seems to be sorting a large fraction of the data for no obvious reason before inserting it.

  • Why, and why only 2.14M out of 4M ?
  • Does it do the same on 12.1.0.2 (yes), what about 12.2.0.1 (no – hurrah: unless it just needs a larger data set!).
  • Is there any clue about this on MoS (yes Bug 17655392 – though that one is erroneously, I think, flagged as “closed not a bug”)
  • Is there a workaround ? (Yes – I think so).

Playing around and trying to work out what’s happening the obvious pointers are the large memory allocation and the “incomplete” spill to disc – what would happen if I fiddled around with workarea sizing – switching it to manual, say, or setting the pga_aggregate_target to a low value. At one point I got results showing 19M rows (that’s not a typo, it really was close to 5 times the number of rows in the table) sorted with a couple of hundred thousand blocks of TEMP used – the 10033 trace showed 9 consecutive passes (that I can’t explain) as the code executed from which I’ve extract the row counts, temp blocks used, and number of comparisons made:


Input records                             3988000
Total disk blocks used                    41544
Total number of comparisons performed     0

Input records                             3554000
Total disk blocks used                    37023
Total number of comparisons performed     0

Input records                             3120000
Total disk blocks used                    32502
Total number of comparisons performed     0

Input records                             2672000
Total disk blocks used                    27836
Total number of comparisons performed     0

Input records                             2224000
Total disk blocks used                    23169
Total number of comparisons performed     0

Input records                             1762000
Total disk blocks used                    18357
Total number of comparisons performed     0

Input records                             1300000
Total disk blocks used                    13544
Total number of comparisons performed     0

Input records                             838000
Total disk blocks used                    8732
Total number of comparisons performed     0

Input records                             376000
Total disk blocks used                    3919
Total number of comparisons performed     0

There really doesn’t seem to be any good reason why Oracle should do any sorting of the data (and maybe it wasn’t given the total number of comparisons performed in this case) – except, perhaps, to allow it to do bulk inserts into each partition in turn or, possibly, to avoid creating an entire new partition at exactly the moment it finds just the first row that needs to go into a new partition. Thinking along these lines I decided to pre-create all the necessary partitions just in case this made any difference – the code is at the end of the blog note. Another idea was to create the table empty (with, and without, pre-created partitions), then do an “insert /*+ append */” of the data.

Nothing changed (much – though the number of rows sorted kept varying).

And then — it all started working perfectly with virtually no rows reported sorted and no I/O to the temporary tablespace !

Fortunately I thought of looking at v$memory_resize_ops and found that the automatic memory management had switched a lot of memory to the PGA, allowing Oracle to do whatever it needed to do completely in memory without reporting any sorting. A quick re-start of the instance fixed that “workaround”.

Still struggling with finding a reasonable workaround I decided to see if the same anomaly would appear if the table were range partitioned but didn’t have an interval clause. This meant I had to precreate all the necessary partitions, of course – which I did by starting with an interval partitioned table, letting Oracle figure out which partitions to create, then disabling the interval feature – again, see the code at the end of this note.

The results: no rows sorted on the insert, no writes to temp. Unless it’s just a question of needing even more data to reproduce the problem with simple range partitioned tables, it looks as if there’s a problem somewhere in the code for interval partitioned tables and all you have to do to work around it is precreate loads of partitions, disable intervals, load, then re-enable the intervals.

Footnote:

Here’s the “quick and dirty” code I used to generate the t2 table with precreated partitions:


create table t2
partition by range(created)
interval(numtodsinterval(7, 'day'))
(
        partition p_start values less than (to_date('&m_tomorrow','dd-mon-yyyy'))
)
storage(initial 1M)
nologging
monitoring
as
select
        *
from
        t1
where
        rownum <= 0
;


declare
        m_max_date      date;
begin
        select  max(created)
        into    expand.m_max_date
        from    t1
        ;

        
        for i in 1..expand.m_max_date - trunc(sysdate) loop
                dbms_output.put(
                        to_char(trunc(sysdate) + loop.i,'dd-mon-yyyy') || chr(9)
                );
                execute immediate
                        'lock table t2 partition for ('''  ||
                        to_char(trunc(sysdate) + loop.i,'dd-mon-yyyy') ||
                        ''') in exclusive mode'
                ;
        end loop;
        dbms_output.new_line();
end;
/

prompt  ========================
prompt  How to disable intervals
prompt  ========================

alter table t2 set interval();

The code causes partitions to be created by locking the relevant partition for each date between the minimum and maximum in the t1 table; locking the partition is enough to create it if it doesn’t already exists. The code is a little wasteful since it locks each partition 7 times as we walk through the dates – but it’s only a quick demo for a model, and for copying a very large table wastage would probably be very small compared to the work of doing the actual data copy. Obviously one could be more sophisticated and limit the code to locking and creating only the partitions needed, and only locking them once each.

 

Taking Notes – 2

Tue, 2018-02-20 05:08

[Originally written August 2015, but not previously published]

If I’m taking notes in a presentation that you’re giving there are essentially four possible reasons:

  • You’ve said something interesting that I didn’t know and I’m going to check it and think about the consequences
  • You’ve said something that I knew but you’ve said it in a way that made me think of some possible consequences that I need to check
  • You’ve said something that I think is wrong or out of date and I need to check it
  • You’ve said something that has given me a brilliant idea for solving a problem I’ve had to work around in the past and I need to work out the details

Any which way, if I’m taking notes it means I’ve probably just added a few more hours of work to my todo list.

Footnote

“Checking” can include:

  • having a chat
  • reading the manuals
  • finding a recent Oracle white-paper
  • searching MoS
  • building some models

Philosophy

Tue, 2018-02-20 05:03

Here’s a note I’ve just re-discovered – at the time I was probably planning to extend it into a longer article but I’ve decided to publish the condensed form straight away.

In a question to the Oak Table a couple of years ago (May 2015) Cary Millsap asked the following:

If you had an opportunity to tell a wide audience of system owners, users, managers, project leaders, system architects, DBAs, and developers “The most important things you should know about Oracle” what would you tell them?

I imagine that since then Cary has probably discussed the pros and cons of some of the resulting thoughts in one of his excellent presentations on how to do the right things, but this was my quick response:

If I had to address them all at once it would be time to go more philosophical than technical.

The single most important point: Oracle is a very large, complex, and flexible product. It doesn’t matter where you are approaching it from you will not have enough information on your own to make best use of it. You have to talk to your peer group to get alternative ideas, and you have to talk to the people at least one step either side of you on the technology chain (dev to dba, dba to sysadmin, Architect to dev, dba to auditor etc.) to understand options and consequences. Create 4 or 5 scenarios of how your system should behave and then get other people – and not just your peer group – to identify their advantages and threats.

Assumptions

Tue, 2018-02-20 02:57

As the years roll on I’ve found it harder and harder to supply quick answers to “simple” questions on the Oracle-L list server and OTN/ODC forum because things are constantly changing and an answer that may have been right the last time I checked could now be wrong. A simple example of the consequences of change showed up recently on the OTN/ODC forum where one reply to a question started:

Just why do you need distinct in a subquery??? That’s the first thing that appears really shocking to me. If it’s a simple in (select …) adding a distinct to the subquery would just impose a sort unique(as you can see in the explain plan), which may be quite costly.

Three question-marks is already tip-toeing its way to the Pratchett limit – but “really shocking” ? It’s bad enough that the comment goes operatic, but going operatic in order to introduce an error pushes the thing into tragedy (or possibly comedy – or maybe both). To make the self-inflicted injury worse, there were two execution plans supplied in the original post anyway of which only one showed any attempt to achieve uniqueness.

Bottom line – when you’re about to correct someone for doing something that is “obviously” wrong, be a little bit kind about it and then be kind to yourself and do a quick sanity check that your attempt at correction is itself correct. A good guideline would be to ask yourself: “How do I know what I know – and am I about to make myself look like an idiot.”

Check It

Question: Does a  “distinct” in a subquery impose a sort (or hash) unique ?

Answer: No – a uniqueness operation may appear, but it’s not guaranteed to appear.

Here’s a quick example which does not result in any attempt at imposing uniqueness (running 11.2.0.4):


drop table t2 purge;
drop table t1 purge;
create table t1 as select * from all_objects where rownum  <= 100;
create table t2 as select * from all_objects where rownum <= 100;

create index t1_i1 on t1(owner);
create index t2_i2 on t2(object_type);

set autotrace traceonly explain

select  * 
from    t1 
where   owner = 'OUTLN' 
and     object_name in (
                select distinct object_name 
                from   t2 
                where  object_type = 'TABLE'
        )
;


Execution Plan
----------------------------------------------------------
Plan hash value: 3169044451

--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     3 |   558 |     4   (0)| 00:00:01 |
|*  1 |  HASH JOIN SEMI              |       |     3 |   558 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     3 |   474 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |     3 |       |     1   (0)| 00:00:01 |
|   4 |   TABLE ACCESS BY INDEX ROWID| T2    |    12 |   336 |     2   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN          | T2_I2 |    12 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("OBJECT_NAME"="OBJECT_NAME")
   3 - access("OWNER"='OUTLN')
   5 - access("OBJECT_TYPE"='TABLE')

Note
-----
   - dynamic sampling used for this statement (level=2)


There’s no sign of a sort unique or hash unique. The optimizer has decided that the IN subquery can be transformed into an EXISTS subquery, which can then be transformed into a semi-join.

I can think of three other execution plan strategies that might have appeared depending on the data, indexing, and statstics:

a) Transform the IN subquery to an EXISTS subquery then operate as a filter subquery (with no uniqueness imposed):


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |   158 |     5   (0)| 00:00:01 |
|*  1 |  FILTER                      |       |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     3 |   474 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |     3 |       |     1   (0)| 00:00:01 |
|*  4 |   TABLE ACCESS BY INDEX ROWID| T2    |     1 |    28 |     2   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN          | T2_I2 |    12 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

b) Simple unnest with sort/hash unique and join


---------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |     1 |   175 |     9 |
|*  1 |  HASH JOIN                     |          |     1 |   175 |     9 |
|   2 |   TABLE ACCESS BY INDEX ROWID  | T1       |     4 |   632 |     1 |
|*  3 |    INDEX RANGE SCAN            | T1_I1    |     2 |       |     1 |
|   4 |   VIEW                         | VW_NSO_1 |     4 |    68 |     7 |
|   5 |    SORT UNIQUE                 |          |     4 |   112 |     7 |
|   6 |     TABLE ACCESS BY INDEX ROWID| T2       |     4 |   112 |     1 |
|*  7 |      INDEX RANGE SCAN          | T2_I2    |     2 |       |     1 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("OBJECT_NAME"="OBJECT_NAME")
   3 - access("OWNER"='OUTLN')
   7 - access("OBJECT_TYPE"='TABLE')

For this data set I actually had to take the optimizer_features_enable back to ‘8.1.7’ to get this plan – but you can see that there’s a SORT UNIQUE at operation 5, but that would have been there whether or not the DISTINCT keyword had appeared in the SQL. Effectively the query has been transformed to:

select  t1.*
from    (
                select  distinct t2.object_name object_name
                from    t2
                where   t2.object_type='TABLE'
        )
        vw_nso_1,
        t1
where   t1.owner = 'OUTLN'
and     t1.object_name = vw_nso_1.object_name
/

c) Unnest, then “place group by” so that the distinct is applied after the join

--------------------------------------------------------------------------------------------
| Id  | Operation                      | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |           |     3 |   474 |     5  (20)| 00:00:01 |
|   1 |  VIEW                          | VM_NWVW_1 |     3 |   474 |     5  (20)| 00:00:01 |
|   2 |   HASH UNIQUE                  |           |     3 |   594 |     5  (20)| 00:00:01 |
|*  3 |    HASH JOIN                   |           |     3 |   594 |     4   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID| T1        |     3 |   510 |     2   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | T1_I1     |     3 |       |     1   (0)| 00:00:01 |
|   6 |     TABLE ACCESS BY INDEX ROWID| T2        |    12 |   336 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN          | T2_I2     |    12 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."OBJECT_NAME"="T2"."OBJECT_NAME")
   5 - access("T1"."OWNER"='OUTLN')
   7 - access("T2"."OBJECT_TYPE"='TABLE')

Note
-----
   - dynamic sampling used for this statement (level=2)

Again, the plan would be the same whether or not the original subquery had a redundant DISTINCT. (Again, with this data set, I had to be a bit devious to get this lan).

The things you think you know may have been true 10 years ago – but maybe they’re not true any longer, or maybe they’re still true on your version of the database but not every  version of the database. So I often end up looking at a question, thinking the poster’s claim can’t be right, and then working out and modelling the circumstances that might make the poster’s observations appear (and learning something new).

Remember: “I’ve never seen it before” doesn’t mean “It doesn’t happen”.

 

Join Factorization

Wed, 2018-02-14 09:38

This item is, by a roundabout route, a follow-up to yesterday’s note on a critical difference in cardinality estimates that appeared if you used the coalesce() function in its simplest form as a substitute for the nvl() function. Connor McDonald wrote a followup note about how using the nvl() function in a suitable predicate could lead to Oracle splitting a query into a UNION ALL (in version 12.2), which led me to go back to a note I’d written on the same topic about 10 years earlier where the precursor of this feature already existed but used CONCATENATION instead of OR-EXPANSION. The script I’d used for my earlier article was actually one I’d written in February 2003 and tested fairly regularly since – which brings me to this article, because I finally tested my script against 12.2.0.1 to discover a very cute bit of optimisation.

The business of splitting a query into two parts can be used even when the queries are more complex and include joins – this doesn’t always happen automatically and sometimes has to be hinted, but that can be a costs/statistics thing) for example, from 12.1.0.2 – a query and its execution plan:


select
        *
from
        t1, t2
where
        t1.v1 = nvl(:v1,t1.v1)
and     t2.n1 = t1.n1
;

---------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |         |  1001 |   228K|    11   (0)| 00:00:01 |
|   1 |  CONCATENATION                          |         |       |       |            |          |
|*  2 |   FILTER                                |         |       |       |            |          |
|*  3 |    HASH JOIN                            |         |  1000 |   228K|     8   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL                   | T2      |  1000 |   106K|     4   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL                   | T1      |  1000 |   122K|     4   (0)| 00:00:01 |
|*  6 |   FILTER                                |         |       |       |            |          |
|   7 |    NESTED LOOPS                         |         |     1 |   234 |     3   (0)| 00:00:01 |
|   8 |     NESTED LOOPS                        |         |     1 |   234 |     3   (0)| 00:00:01 |
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED| T1      |     1 |   125 |     2   (0)| 00:00:01 |
|* 10 |       INDEX RANGE SCAN                  | T1_IDX1 |     1 |       |     1   (0)| 00:00:01 |
|* 11 |      INDEX UNIQUE SCAN                  | T2_PK   |     1 |       |     0   (0)| 00:00:01 |
|  12 |     TABLE ACCESS BY INDEX ROWID         | T2      |     1 |   109 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(:V1 IS NULL)
   3 - access("T2"."N1"="T1"."N1")
   5 - filter("T1"."V1" IS NOT NULL)
   6 - filter(:V1 IS NOT NULL)
  10 - access("T1"."V1"=:V1)
  11 - access("T2"."N1"="T1"."N1")

You can see in this plan how Oracle has split the query into two queries combined through concatenation with FILTER operations at lines 2 (:v1 is null) and 6 (:v1 is not null) to allow the runtime engine to execute only the appropriate branch. You’ll also note that each branch can be optimised separately and in this case the two branches get dramatically different paths because of the enormous difference in the estimated volumes of data.

So let’s move up to 12.2.0.1 and see what happens to this query – but first I’m going to execute a naughty “alter session…”:


------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                 |  1001 |   180K|    11   (0)| 00:00:01 |
|   1 |  VIEW                                    | VW_ORE_F79C84EE |  1001 |   180K|    11   (0)| 00:00:01 |
|   2 |   UNION-ALL                              |                 |       |       |            |          |
|*  3 |    FILTER                                |                 |       |       |            |          |
|   4 |     NESTED LOOPS                         |                 |     1 |   234 |     3   (0)| 00:00:01 |
|   5 |      NESTED LOOPS                        |                 |     1 |   234 |     3   (0)| 00:00:01 |
|   6 |       TABLE ACCESS BY INDEX ROWID BATCHED| T1              |     1 |   125 |     2   (0)| 00:00:01 |
|*  7 |        INDEX RANGE SCAN                  | T1_IDX1         |     1 |       |     1   (0)| 00:00:01 |
|*  8 |       INDEX UNIQUE SCAN                  | T2_PK           |     1 |       |     0   (0)| 00:00:01 |
|   9 |      TABLE ACCESS BY INDEX ROWID         | T2              |     1 |   109 |     1   (0)| 00:00:01 |
|* 10 |    FILTER                                |                 |       |       |            |          |
|* 11 |     HASH JOIN                            |                 |  1000 |   228K|     8   (0)| 00:00:01 |
|  12 |      TABLE ACCESS FULL                   | T2              |  1000 |   106K|     4   (0)| 00:00:01 |
|* 13 |      TABLE ACCESS FULL                   | T1              |  1000 |   122K|     4   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(:V1 IS NOT NULL)
   7 - access("T1"."V1"=:V1)
   8 - access("T2"."N1"="T1"."N1")
  10 - filter(:V1 IS NULL)
  11 - access("T2"."N1"="T1"."N1")
  13 - filter("T1"."V1" IS NOT NULL)

There’s nothing terribly exciting about the change – except for the disappearence of the CONCATENATION operator and the appearance of the VIEW and UNION ALL operators to replace it (plus you’ll see that the two branches appear in the opposite order in the plan). But let’s try again, without doing that “alter session…”:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                    |  1001 |   229K|    10   (0)| 00:00:01 |
|*  1 |  HASH JOIN                              |                    |  1001 |   229K|    10   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL                     | T2                 |  1000 |   106K|     4   (0)| 00:00:01 |
|   3 |   VIEW                                  | VW_JF_SET$A2355C8B |  1001 |   123K|     6   (0)| 00:00:01 |
|   4 |    UNION-ALL                            |                    |       |       |            |          |
|*  5 |     FILTER                              |                    |       |       |            |          |
|*  6 |      TABLE ACCESS FULL                  | T1                 |  1000 |   122K|     4   (0)| 00:00:01 |
|*  7 |     FILTER                              |                    |       |       |            |          |
|   8 |      TABLE ACCESS BY INDEX ROWID BATCHED| T1                 |     1 |   125 |     2   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN                  | T1_IDX1            |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."N1"="ITEM_1")
   5 - filter(:V1 IS NULL)
   6 - filter("T1"."V1" IS NOT NULL)
   7 - filter(:V1 IS NOT NULL)
   9 - access("T1"."V1"=:V1)

The plan now shows a VIEW which is a UNION ALL involving only table t1 in both its branches. The result set from the view is then used as the probe table of a hash join with t2. You’ll note that the name of the view is now VW_JF_SET$A2355C8B – that’s JF for “Join Factorization”, and the alter session I excecuted to get the first plan was to disable the feature: ‘alter session set “_optimizer_join_factorization”= false;’.

Join factorization can occur when the optimizer sees a union all view with some tables that are common to both (all) branches of the query, and finds that it can move those tables outside the query while getting the same end result at a lower cost. In this case it happens to be a nice example of how the optimizer can transform and transform again to get to the lowest cost plan.

It’s worth noting that Join Factorization has been around since 11.2.x.x, and Or Expansion has been around for even longer – but it’s not until 12.2 that nvl() transforms through Or Expansion, which allows it to transform through Join Factorization.

You’ll note, by the way that with this plan we always do a full tablescan of t2, whereas with just Or-Expansion it’s a potential threat that may never (or hardly ever) be realised.  That’s a point to check if you find that the transformation starts to appear inappropriately on an upgrade. There is a hint to disable the feature for a query, but it’s not trivial to get it right so if you do need to block the feature the smart hint (or SQL Patch) would be “opt_param(‘_optimizer_join_factorization’ ‘false’)”.

Footnote:

If you want to run the experiments yourself, here’s the script I used to generate the data. It’s more complicated than it needs to be because I use the same tables in several different tests:

rem
rem     Script:         null_plan_122.sql
rem     Author:         Jonathan Lewis
rem     Dated:          February 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Join Factorization
rem             12.1.0.2        Concatenation
rem
rem

drop table t2;
drop table t1;

-- @@setup  -- various set commands etc.

create table t1 (
        n1              number(5),
        n2              number(5),
        v1              varchar2(10),
        v2              varchar2(10),
        v3              varchar2(10),
        v4              varchar2(10),
        v5              varchar2(10),
        padding         varchar2(100),
        constraint t1_pk primary key(n1)
);

insert into t1
select
        rownum,
        rownum,
        rownum,
        trunc(100 * dbms_random.value),
        trunc(100 * dbms_random.value),
        trunc(100 * dbms_random.value),
        trunc(100 * dbms_random.value),
        rpad('x',100)
from all_objects
where
        rownum <= 1000 -- > comment to avoid WordPress format mess
;

create unique index t1_n2 on t1(n2);

create index t1_idx1 on t1(v1);
create index t1_idx2 on t1(v2,v1);
create index t1_idx3 on t1(v3,v2,v1);

create table t2 (
        n1              number(5),
        v1              varchar2(10),
        padding         varchar2(100),
        constraint t2_pk primary key(n1)
);

insert into t2
select
        rownum,
        rownum,
        rpad('x',100)
from all_objects
where
        rownum <= 1000     -- > comment to avoid WordPress format mess
;

create index t2_idx on t2(v1);

begin dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

variable n1 number
variable n2 number
variable v1 varchar2(10)
variable v2 varchar2(10)
variable v3 varchar2(10)

exec :n1 := null
exec :n2 := null
exec :v1 := null
exec :v2 := null
exec :v3 := null

spool null_plan_122

set autotrace traceonly explain

prompt  ============================================
prompt  One colx = nvl(:b1,colx) predicate with join
prompt  ============================================

select
        *
from
        t1, t2
where
        t1.v1 = nvl(:v1,t1.v1)
and     t2.n1 = t1.n1
;

alter session set "_optimizer_join_factorization" = false;

select
        *
from
        t1, t2
where
        t1.v1 = nvl(:v1,t1.v1)
and     t2.n1 = t1.n1
;

alter session set "_optimizer_join_factorization" = true;

set autotrace off

spool off

Coalesce v. NVL

Tue, 2018-02-13 05:23

“Modern” SQL should use the coalesce() function rather than the nvl() function – or so the story goes – but do you always want to do that to an Oracle database ? The answer is “maybe not”. Although the coalesce() function can emulate the nvl() function (in many cases) there are significant differences in behaviour, some that suggest it’s a good idea to use the substitution and others that suggest otherwise. Different decisions may be appropriate for different circumstances, and this note highlights one case against the substitution. We’ll start with a simple data set:

rem
rem     Script:         nvl_coalesce_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level  comment to avoid wordpress format mess
)
select
        rownum                          id,
        case mod(rownum,4)
                when 0  then 'Y'
                        else 'N'
        end                             yes_no,
        case mod(rownum,5)
                when 0  then 'Y'
                when 1  then null
                        else 'N'
        end                             yes_null_no,
        lpad('x',100,'x')               padding
from
        generator
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1 for columns size 5 yes_no yes_null_no'
        );
end;
/

I’ve created a table with 10,000 rows which two columns with a highly skewed data distribution. Because I know that the skew is supposed to have a significant effect I’ve used a non-standard method_opt when gathering stats – in a production system I would have the packaged procedure dbms_stats.set_table_prefs() to associate this with the table.

The difference between the yes_no and the yes_null_no columns is that the latter is null for a significant fraction of the rows.

  • yes_no has: 7,500 N, 2,500 Y
  • yes_null_no has: 6,000 N, 2,000 null, 2,000 Y

Let’s now try to count the “N or null” rows using two different functions and see what estimates the optimizer produces for the counts. First counting the yes_no column – using nvl() then coalesce()


set autotrace traceonly explain

select * from t1 where nvl(yes_no,'N') = 'N';
select * from t1 where coalesce(yes_no,'N') = 'N';

set autotrace off

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  7500 |   798K|    24   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  7500 |   798K|    24   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("YES_NO",'N')='N')

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 10900 |    25   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 10900 |    25   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COALESCE("YES_NO",'N')='N')

The estimate for the nvl() is accurate; the estimate for the coalesce() query is 100 rows.

Let’s repeat the test using the yes_null_no column, again starting with nvl() followed by coalesce():


set autotrace traceonly explain

select * from t1 where nvl(yes_null_no,'N') = 'N';
select * from t1 where coalesce(yes_null_no,'N') = 'N';

set autotrace off

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  8000 |   851K|    24   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  8000 |   851K|    24   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("YES_NULL_NO",'N')='N')

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 10900 |    25   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 10900 |    25   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COALESCE("YES_NULL_NO",'N')='N')

Again we get the right result for the nvl() estimate (8,000 = 6,000 N + 2,000 null) and 100 for the coalesce() estimate.

By now you’ve probably realised that the coalesce() estimate is simply the “1% guess for equality” that applies to most cases of function(column). So, as we saw in the previous post, coalesce() gives us the benefits of “short-circuiting” but now we see it also threatens us with damaged cardinality estimates. The latter is probably less important than the former in many cases (especially since we might ne able to address the problem very efficiently using virtual columns), but it’s probably worth remembering.

 

 

Histogram Threat

Tue, 2018-01-30 02:07

Have you ever seen a result like this:


SQL> select sql_id, count(*) from V$sql group by sql_id having count(*) > 1000;

SQL_ID		COUNT(*)
------------- ----------
1dbzmt8gpg8x7	   30516

A client of mine who had recently upgraded to 12.2.0.1 RAC, using DRCP (database resident connection pooling) for an application using PHP was seeing exactly this type of behaviour for a small number of very simple SQL statements and wanted to find out what was going on because they were also seeing an undesirable level of contention in the library cache when the system load increased.

In this note I just want to highlight a particular detail of their problem – with an example – showing how easily histograms can introduce problems if you don’t keep an eye out for the dangers.

One of their queries really was as simple as this:

select count(*), sum(skew2) from t1 where skew = :b1;

And I’m going to use this query to model the problem. All I have to do is arrange for a data set that results in a hybrid (or height-balanced) histogram being created on the skew column, and then run the query lots of times with different input bind values. In the case of the client there were around 18,000 possible values for the column, and the number of rows per value varied from 1 to about 20,000 – but whatever the number of rows selected the optimum execution plan was always going to be an indexed access.


rem
rem     Script:         acs_child_cursors.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem

create table t1 (
        id, skew, skew2, padding
)
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 3e3
)
select
        rownum  id,
        g1.id   id1,
        g2.id   id2,
        rpad('x',100)
from
        generator       g1,
        generator       g2
where
        g2.id <= g1.id     -- > comment to avoid WordPress format issue
order by
        g2.id, g1.id
;

alter table t1 modify skew not null;
alter table t1 modify skew2 not null;

create index t1_skew on t1(skew);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1 for columns skew size 254'
--              method_opt       => 'for all columns size 1'
        );
end;
/

variable b1 number
exec :b1 := 0;

set termout off
@start_1000
@start_1000
@start_1000
set termout on

set linesize 120
set trimspool on

column sql_text format a55

select
        child_number, plan_hash_value, executions,
        is_bind_sensitive,
        is_bind_aware,
        is_shareable,
        sql_text
from
        v$sql
where   sql_id = 'b82my582cnvut'
;

The data set contains 3,000 distinct values for skew and the way I’ve generated the rows means that the value N will appear N times – so there’s one row with the value 1 and 3,000 rows with the value 3,000 and so on for a total of 4,501,500 rows. If you want to run the tes the code is likely to take a couple of minutes to complete, requiring roughly 700 MB of disk space.

The mechanism of the script start_1000 is something I published a few years ago, and essentially it executes a script called start_1.sql 1,000 times which, for this test, contains the following two lines:


exec :b1 := :b1 + 1

select count(*), sum(skew2) from t1 where skew = :b1;

The net effect of the 3 calls to start_1000.sql is that my simple SQL statement is called once in turn for each value of skew from 1 to 3,000. The SQL_ID of the statement is ‘b82my582cnvut’ which I’ve used to query v$sql when the run is complete, with the following result:


CHILD_NUMBER PLAN_HASH_VALUE EXECUTIONS I I I SQL_TEXT
------------ --------------- ---------- - - - -------------------------------------------------------
	   0	  1041516234	    498 Y N N select count(*), sum(skew2) from t1 where skew = :b1
	   1	  1041516234	     25 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   2	  1041516234	    104 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   3	  1041516234	    308 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   4	  1041516234	    429 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   5	  1041516234	    640 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   6	  1041516234	     31 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   7	  1041516234	    305 Y Y N select count(*), sum(skew2) from t1 where skew = :b1
	   8	  1041516234	    660 Y Y Y select count(*), sum(skew2) from t1 where skew = :b1

9 rows selected.

I’ve got 9 child cursors, all with the same execution plan, all except the last labelled as not shareable (you may find that you don’t get exactly the same result, on repeated tests I got between 5 and 9 cursors). Given enough time all these cursors except the last (shareable) one would be aged out of the library cache. In the case of the client, who had a shared pool that was probably quite a bit larger than needed, the number of non-shareable cursors could get very large and they were hanging around for ages. Typically most of the cursors would report an execution count less than 30, with many showing just one or two executions and a handful showing execution counts in the region of several hundred (and that hanful were the ones that were still marked as shareable).

After eliminating the critical histogram (using dbms_stats.delete_column_stats()) and eliminating the redundant child cursors (using dbms_pool.purge()) the massive proliferation stopped happening and the performance threat disappeared. The only issue then was to change the table preferences for stats collection on this table to add the clause “for columns size 1 skew” so that the histogram would not be recreated on the next gather.

Further Observations.

I suspect that part of the client’s probem – something that exaggerated the count rather than causing it – could be attributed to using DRCP (database resident connection pool) which probably still has some quirky bits of behaviour. It was also true that the client’s connection pool was probably much bigger than it needed to be so if there were any funny little bits of emergent behaviour at scale the client would probably have hit them.

The problem of escalating child cursors is one that Oracle has been working on for quite a long time, and there’s a (hidden) parameter that was introduced late in 11gR2 (though I think that the 11g mechanism first appeared through a fix control) to allow Oracle to mark a parent cursor obsolete if it acquired too many child cursors.  There’s a note on MoS that the client had read on this topic: Doc ID: 2298504.1: Cursor Mutex X Wait Events: After Upgrading To 12.2″ which looked as if it was describing their symptoms so they had set this parameter (_cursor_obsolete_threshold) from 8192 (the 12.2 default) down to 1024 (the default for 12.1 and earlier versions). This had been of some help with the library cache problem.  When the sql_id at the top of this article reported 30,516 child cursors that would be 29 “obsolete” parent cursors with 1,024 childs cursor and one “live” parent cursor with 820 child cursors.

You can appreciate that if Oracle has to find a parent cursor and pin it while walking a chain of 30,516 child cursors that’s likely to take a lot more time than walking a chain of 30 parent cursors (while holding a library cache mutex, perhaps) to find the one non-obsolete parent, then acquiring the parent mutex to walk a chain of 820 child cursor.

I don’t know the exact details of the mechanisms involved with manipulating the various chains – but there are likely to be times when one process needs exclusive mutexes/latches to change a chain while other processes are holding shared mutexes/latches to search the chain. When you’ve got 30,000 child cursors in total the important questions become: “where’s the trade-off between making the child chains shorter and making the parent search longer ?” (I’d expect – or hope – that the Oracle developers had actually found a way to code the obsolence so that the new parent was first in the search, and the rest were never visited, of course.)

One of the suggestions I made to try to alleviate the problem – which I had assumed was due in part to the degree of concurrent execution of the statement – was to mark the cursor as “hot” This resulted in 36 differet sql_ids for the same statement (the client machine had 72 CPUs). This had some effect but ultimately meant that there were 36 chains of parents that would eventually end up with lots of child cursors – the only difference was the rate at which the total cursor count was growing (a lot slower), balanced against the threat that they might all manage to get to 30,000+ child cursors! Instead, as a quick and dirty workaround, I supplied the client with a script that could be run at quieter moments during the day to call dbms_shared_pool.purge() for the sql_id to flush all its cursors from the library cache.

One final oddity – which isn’t from the client site. When I changed my test above to avoid generating the histogram (using the commented out method_opt “for all columns size 1”) I found that I still got two child cursors; the first ended up marked as bind-aware but insensitive and non-shareable, the second appeared (time and time again) when my bind value got to 499 and was marked bind-sensitive, bind-aware and shareable.  I still have no idea why this happened.

Footnote:

When repeating the test I started with a “flush shared_pool” – but I’ve removed this line from the code above in case anyone ran it without considering the possible side effects. You do need to clear previous copies of the key statement from the library cache, though, if you want to avoid confusing the issue on repeated runs of the test.

 

Case Study – 1

Mon, 2018-01-29 04:45

It has been some time since I wrote an article walking through the analysis of information on an AWR report, but a nice example appeared a few weeks ago on Twitter that broke a big AWR picture into a sequence of bite-sized chunks that made a little story so here it is, replayed in sync with my ongoing thoughts. The problem started with the (highly paraphrased) question – “How could I get these headline figures when all the ‘SQL ordered by’ sections of the report show captured SQL account for 0.0% of Total?”. The report was a 30 minute snapshot from 11.2.0.3, and here’s the Load Profile:As you can see, the database was doing quite a lot of work as measured by the physical reads and writes, the number of block changes and size of redo, and the fact that the average CPU usage by the foreground sessions in this instance accounts for 9.5 CPUs. Generally speaking the 49 seconds per second of DB time is also a fairly good hint,when combined with the other numbers, that the instance is busy but, in the absence of any other information, that could be one session holding a blocking lock with 48 other sessions waiting for the lock to be released.

There are a couple of unusual (though not impossible) features to this profile. Given the amount of work the figures for user calls and executes look very small – again not technically impossible, just unlikely in “normal” database processing given the apparent workload – and bear in mind that the “SQL ordered by ..” figures seem to account for none of the workload. Similarly the figures for logical reads and block changes are a little unusual (for “normal” processing) – on average this instance seems to have modified every block it visited (without any work showing up in the captured SQL).

Next in the set of images was the Time Model:As you can see, the foreground time (DB time) is 85,944 seconds or which foreground CPU usage (DB CPU) is 16,735 seconds – with about 69,000 seconds unaccounted ! THere’s virtually no time spend on PL/SQL or SQL, and Rman doesn’t even make an appearance  (I mention rman specifically because there was at least one version of Oracle where the rman time was accidentally lost from this summary).

So what does the Top Timed Events look like:

It’s nice to see that this is consistent with the Load Profile: the DB CPU matches, and there’s a lot of physical reads (and a quick check says that 6,560,642/1,800 = 3,644 … which is pretty close to the 3,746 physical reads per second in the Load Profile).  There’s one major anomaly here, though: the huge number of (and time spent on) latch: row cache objects. and even though it’s not the single largest component of time it’s the most obvious one to pursue so the next bit of the report to check is the Dictionary Cache Statistics, with the Tablespace IO Stats and Segments by Physical Reads to follow. I don’t have an image for the dictionary cache stats, but the answer to the question “What were all the rowcache object gets for?” was: “dc_tablespaces (214,796,434)” – which (probably) told me everything I needed to know.

I could show you the specific Instance Activity statistic that I wanted to see next, but I’ll postpone that for a moment and jump to the Tablespace IO Stats – which we were planning to do and might have done anyway if we hadn’t known the rocket science behind massive number of gets on dc_tablespaces.

That’s a horrendous number of (single block) reads of the undo tablespace – and why would they be happening ? The instance engaged in some massive rollback activity (and the transactions being rolled back are on objects in the GIRO tablespace – which is why it is also suffering a lot of single block reads) and this is the point where we jump to the relevant Instance Activity statistic to confirm the claim:

There are two other “undo records applied” statistics, but we don’t need to think about them – the match between the count of records applied and the gets on the dc_tablespaces latch is nearly perfect. Almost everything that this instance is doing is rolling back – there must have been some enormous data changes (possibly only one, possibly parallel-enabled) that failed in the previous half hour and now the entire mess is being cleaned up.

One little detail to note – the “undo records applied” per second is 122,355, but the Load Profile reported 247,885 “Block changes” per second. The two figures are consistent with each other. Each application of an undo record is two block changes – the first when you modify the original data block, the second when you update the undo record itself to flag it as “applied”:  122,355 * 2  = 244,710, which is a good match for 247,855.

Final Thoughts

There is a second reason why you could see lots of gets on dc_tablespaces – but I think it was a bug in 9i relating to temporary tablespace groups. The phenomenon as a side effect of rolling back was something I discovered in the 8i timeline and I’ve not actually checked what an AWR report really would look like if I forced a massive rollback to take place as the only workload across the interval – so here’s a quick test I constructed and ran to finish the thread off:


set timing on

create table t1 as
with g as (select rownum id from dual connect by level <= 1000)
select rownum id, rownum n1, rpad ('x',150) padding from g,g
;

create index t1_i1 on t1(id);

begin
        for i in 1..1000000 loop
                update t1 set n1 = n1 where id = i;
        end loop;
end;
/

alter system flush buffer_cache;

pause take AWR snapshot from another session

rollback;

prompt take AWR snapshot from another session

On the small VM I used to run the test it took a little over 60 seconds to run the update, and the same again to complete the rollback. The “DB time” show in the Time Model section of the AWR report was 62.0 seconds, while the “sql execute elapsed time” was 3.9 seconds (which was the SQL executed while capturing the AWR data).

Conclusion

This was a problem where the database seemed to have done a lot of work that couldn’t be attributed to an SQL. While I had a little rocket science up my sleeve that may have allowed me to identify the source more quickly and with more confidence than the average DBA all I’ve done in this note it highlight a couple of oddities and big numbers that anyone could have spotted, and followed a couple of simple guesses:

a) DB time was large, but sql (and pl/sql) time was non-existent

b) Top timed events were I/O and latches – so identify the latch

c) The specific latch pointed towards tablespaces – so check the Tablespace I/O and note the UNDO tablespace

d) Look at any Instance Activity statistics about “undo”.

e) Job done – but a quick test case is the icing on the cake.

 

 

gc buffer busy

Thu, 2018-01-25 08:12

I had to write this post because I can never remember which way round Oracle named the two versions of gc  buffer busy when it split them. There are two scenarios to cover when my session wants my instance to acquire a global cache lock on a block and some other session is already trying to acquire that lock (or is holding it in an incompatible fashion):

  • The other session is in my instance
  • The other session is in a remote instance

One of these cases is reported as “gc buffer busy acquire”, the other as a “gc buffer busy release” – and I always have to check which is which. I think I usually get it right first time when I see it, but I always manage to convince myself that I might have got it wrong and end up searching the internet for Riyaj Shamsudeen’s blog posting about it.

The “release” is waiting for another instance to surrender the lock to my instance; the “acquire” is waiting for another session in my instance to finish acquiring the lock from the other  instance.

I decided to jot down this note so I didn’t have to keep searching for Riyaj’s and also because a little problem on OTN at the moment showed a couple of AWR reports with an unlikely combination of waits for acquire (180,000,000) and release (2,000) waits.

If you’re wondering why this looks odd – if I’m waiting for an acquire someone else in my instance must be waiting for a release.  Obviously many sessions could be waiting for one release, and if acquirers time out very rapidly (though they’re not reported as doing so) then the ratio could get very high – but 90,000 acquires per release doesn’t look right.

 

Pages