Jonathan Lewis

Subscribe to Jonathan Lewis feed Jonathan Lewis
Just another Oracle weblog
Updated: 4 hours 26 min ago

Massive Delete

Fri, 2018-06-08 03:14

The question of how to delete 25 million rows from a table of one billion came up on the ODC database forum recently. With changes in the numbers of rows involved it’s a question that keeps coming back and I wrote a short series for AllthingsOracle a couple of years ago that discusses the issue. This is note is just a catalogue of links to the articles:

There is an error in part 2 in the closing paragraphs – it says that the number of index entries deleted varies “from just one to 266″, it actually varies from 181 to 266.

 

Index Bouncy Scan 4

Fri, 2018-06-01 03:19

There’s always another hurdle to overcome. After I’d finished writing up the “index bouncy scan” as an efficient probing mechanism to find the combinations of the first two columns (both declared not null) of a very large index a follow-up question appeared almost immediately: “what if it’s a partitioned index”.

The problem with “typical” partitioned indexes is that the smallest value of the leading column might appear in any of the partitions, and the combination of that value and the smallest value for the second column might not appear in all the partitions where the smallest value appears. Consider a table of 10 partitions and a locally partitioned index on (val1, val2) where neither column is the partition key. The smallest value of val1 – call it k1 may appear only in partitions 4, 7, 8, 9, 10; the lowest combination of (val1, val2) – call it (k1, k2) may appear only in partitions 8 and 10. In a global (or globally partitioned) index the pair (k1, k2) would be at the low (leftmost) end of the index, but to find the pair in a locally partitioned index we have to probe the leftmost end of 10 separate index partitions – and once we’ve done that each “bounce” requires us to probe 10 index partitions for the first (val1, val2) pair where val1 = k1 and val2 is just just greater than k2, or val1 is just greater than k1 and val2 is the minimum for that value of val1. The more partitions we have the greater the number of index partitions we have to probe at each step and the more likely it is that we ought to switch to a brute force index fast full scan with aggregate.

Here’s the starting point for solving the problem (maybe) – I’ll create a simple partitioned table, and use the “bouncy scan” code from the earlier posting with the table and column names adjusted accordingly:


rem
rem     Script:         bouncy_index_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10) ; alter table pt1 modify(status not null); execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

prompt  ==================================================
prompt  Make some rows in the last partition have a status
prompt  that won't be found in the first partition.
prompt  ==================================================

column namespace format 99999999
column partition_name new_value m_part

select  partition_name
from    user_tab_partitions
where   table_name = 'PT1'
order by
        partition_position
;

update pt1 partition (&m_part) set status = 'MISSING' where rownum <= 10;

select
        dbms_mview.pmarker(rowid), status, namespace
from    pt1
where   status = 'MISSING'
;

I’ve created a hash partitioned copy of view all_objects, duplicating it 10 times and created a local index on the columns (status, namespace). My data has two values for status, ‘VALID’ and ‘INVALID’, and there are about 10 values for the namespace. I’ve then updated a few rows in the last partition, giving them a status value that is between the two current values – this is just one little test case to help me check that my code is going to catch all values even if they don’t appear in the first table partition.

Here’s the query from the earlier posting – and it does get the right results – followed by the execution plan:


alter session set statistics_level = all;

set serveroutput off
set linesize 180
set pagesize 60

prompt  =============================================================
prompt  Original Query, showing expensive access for driving minimums
prompt  =============================================================

with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select
                        /*+ index(pt1) no_index_ffs(pt1) */
                        status, namespace,
                        row_number() over(order by status, namespace) rn
                from    pt1
        )
        where
                rn = 1
        union all
        select
                v1.status, v1.namespace
        from    bounce1,
                lateral (
                              select  /*+ index(pt1) no_index_ffs(pt1) no_decorrelate */
                                      pt1.status, pt1.namespace
                              from    pt1
                              where   pt1.status > bounce1.status
                              and     rownum = 1
                ) v1
        where   bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last outline'));


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        | 16378 (100)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 | 16378   (4)|       |       |     10 |00:00:00.58 |    1869 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 | 16377   (4)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     12 |00:00:00.58 |    1869 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |  8157   (4)|       |       |      2 |00:00:00.58 |    1747 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      2 |00:00:00.58 |    1747 |  1024 |  1024 | 2048  (0)|
|*  6 |       VIEW                                   |                 |      1 |      1 |  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |                 |      1 |    617K|  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |  2048 |  2048 | 2048  (0)|
|   8 |         PARTITION HASH ALL                   |                 |      1 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.34 |    1732 |       |       |          |
|   9 |          INDEX FULL SCAN                     | PT1_I1          |      4 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.15 |    1732 |       |       |          |
|  10 |       NESTED LOOPS                           |                 |      2 |      1 |  4110   (4)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  11 |        RECURSIVE WITH PUMP                   |                 |      2 |        |            |       |       |      2 |00:00:00.01 |       0 |       |       |          |
|  12 |        VIEW                                  | VW_LAT_1BBF5C63 |      2 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|* 13 |         COUNT STOPKEY                        |                 |      2 |        |            |       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  14 |          PARTITION HASH ALL                  |                 |      2 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|* 15 |           INDEX RANGE SCAN                   | PT1_I1          |      5 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|  16 |     SORT AGGREGATE                           |                 |     10 |      1 |            |       |       |     10 |00:00:00.01 |     122 |       |       |          |
|  17 |      PARTITION HASH ALL                      |                 |     10 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  18 |       FIRST ROW                              |                 |     40 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     122 |       |       |          |
|* 19 |        INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     40 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  20 |     RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     10 |00:00:00.01 |       0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
   4 - filter("BOUNCE1"."STATUS" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "STATUS","NAMESPACE")<=1) 13 - filter(ROWNUM=1) 15 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  19 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

In terms of time the query doesn’t seem to have done too badly – but I’m only using a small data set and we can see from the numbers that we haven’t produced an efficient plan. Operations 8 and 9 tell us that we’ve done an index full scan on every single partition before passing the data up for a window sort operation. That’s clearly a bad thing, but we did have an index() hint at that bit of code that worked very well for the simple (global) index so maybe we should have taken that out before testing (except it doesn’t help much to do so since Oracle still scans all 617K rows, changing to an index fast full scan).

Apart from that massive load the rest of the query looks quite efficient. We keep seeing “partition hash all” of course – whatever we do we tend to do it to 4 separate partitions one after the other – but everything else we do looks rather efficient. But there is another problem – and this is where the importance of inserting the rows with status = ‘MISSING’ shows up: this query didn’t find them! We have a predicate “rownum = 1” in the second half of the bounce1 recursive subquery and because we’re using a partitioned index we’ve managed to find a row that looks appropriate in an early partition when the row we really needed doesn’t appear until the last partition.

Let’s return to this problem later – first we want to check if the rest of the query will run efficiently and give us the right answer if we can find some way of getting the starting values; so let’s use a strategy we’ve used before – replace the bounce1 subquery with a union all select from dual:


with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select 'INVALID' status, 1 namespace from dual
                union all
                select 'MISSING', 4 from dual
                union all
                select 'VALID', 1 from dual
        )
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last partition outline'));

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |        |      1 |        |    76 (100)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   1 |  SORT ORDER BY                             |        |      1 |      6 |    76   (2)|       |       |     11 |00:00:00.01 |     132 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |        |      1 |      6 |    75   (0)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|        |      1 |        |            |       |       |     14 |00:00:00.01 |     132 |  1024 |  1024 |          |
|   4 |     VIEW                                   |        |      1 |      3 |     6   (0)|       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   5 |      UNION-ALL                             |        |      1 |        |            |       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   6 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   7 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   8 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |     SORT AGGREGATE                         |        |     11 |      1 |            |       |       |     11 |00:00:00.01 |     132 |       |       |          |
|  10 |      PARTITION HASH ALL                    |        |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  11 |       FIRST ROW                            |        |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     132 |       |       |          |
|* 12 |        INDEX RANGE SCAN (MIN/MAX)          | PT1_I1 |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  13 |     RECURSIVE WITH PUMP                    |        |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
  12 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

This gets us the right answer, very efficiently. There are only 11 rows in the result set and we have an average 12 buffer visits per row – which is reasonble given that we (probably) have to probe 4 index partitions for every row. So that’s 11 * 4 * 3 buffer visits per probe – which seems just about optimal.

The next step is to figure out a way of getting the (three in our case) starting points while using a partitioned index. Here’s a query we can use for bounce1:


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
)
select * from bounce1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

It looks a little convoluted with all the inline select statements, but they all do very small amounts of work and they’re only reading the index leaf blocks that you have to read. We know from yesterday’s post that Oracle can execute the scalar subqueries at lines 3 and 4 very efficiently; we can hope (and check) that the lateral() subqueries driven by the single values from the recursive row in bounce1 will operate just as efficiently – and here’s the plan:


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                 |      1 |        |   166 (100)|       |       |      4 |00:00:00.01 |     132 |
|   1 |  VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   3 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   4 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   5 |      INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   6 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   7 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   8 |      FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  9 |       INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  11 |         PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  12 |          INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  13 |    FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  14 |    NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  15 |     NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  16 |      RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  17 |      VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |       SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  19 |        PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  20 |         FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 21 |          INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |     VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  23 |      SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  24 |       PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  25 |        FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 26 |         INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |           PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |            FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |             INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("STATUS"=)
  21 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  26 - access("PT1"."STATUS"=)
  30 - access("PT2"."STATUS">:B1)

Although we have done lots of individual probes into the index they have all been very efficient using a min/max access and an average of about 3 buffer visits per probe. So we can now insert this new bounce1 subquery into the previous query in place of the union all of dual and check that the two pieces of the query cooperate.


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace from bounce1
        union all
        select  bounce2.status, (select min(t.namespace) namespace from pt1 t where t.namespace > bounce2.namespace and status=bounce2.status) namespace
        from    bounce2
        where   bounce2.status is not null
        and     bounce2.namespace is not null
)
select  *
from    bounce2
where   namespace is not null
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |                 |      1 |        |   396 (100)|       |       |     11 |00:00:00.01 |     266 |
|*  1 |  VIEW                                       |                 |      1 |      4 |   396   (1)|       |       |     11 |00:00:00.01 |     266 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     15 |00:00:00.01 |     266 |
|   3 |    VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   5 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   8 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   9 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  12 |          SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  13 |           PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  14 |            INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  15 |      FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  16 |      NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  17 |       NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |        RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  19 |        VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  20 |         SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  21 |          PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |           FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 23 |            INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  24 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  25 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  26 |         PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 28 |           INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  29 |            SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  30 |             PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  31 |              FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 32 |               INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  33 |    SORT AGGREGATE                           |                 |     11 |      1 |            |       |       |     11 |00:00:00.01 |     134 |
|  34 |     PARTITION HASH ALL                      |                 |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  35 |      FIRST ROW                              |                 |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     134 |
|* 36 |       INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  37 |    RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NAMESPACE" IS NOT NULL)
  11 - access("STATUS"=)
  23 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  28 - access("PT1"."STATUS"=)
  32 - access("PT2"."STATUS">:B1)
  36 - access("STATUS"=:B1 AND "T"."NAMESPACE">:B2)

Job done. We’ve found the distinct set of pairs without having to scan the entire index. We’ve found 11 pairs at a total cost of 266 buffer gets. For comparitive purposes the query totalled 56 buffer visits when I recreated the table as a non-partitioned table (again updating a few rows to status = ‘MISSING’).

It’s important to note that this query can only work this efficiently in 12.2 (and possibly in a suitably patched 11.2.0.4) because of the optimizer’s ability to use the min/max operation for queries like: “select max(col2) where col1 = (select max()…))”. When I ran the final query on 12.1.0.2 the execution plan changed around lines 11 and 28 where 12.2.0.1 could use the aggregate subquery to drive the min/max scan 12.1.0.2 did a real range scan with aggregate (which was extremely expensive at one point).

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|  23 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |  1206   (2)|       |       |      3 |00:00:05.43 |    2414 |
|  24 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:05.43 |    2414 |
|  25 |         PARTITION HASH ALL                  |                 |      3 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:05.84 |    2414 |
|* 26 |          INDEX RANGE SCAN                   | PT1_I1          |     12 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:02.03 |    2414 |
|  27 |           SORT AGGREGATE                    |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |            PARTITION HASH ALL               |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |             FIRST ROW                       |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |              INDEX RANGE SCAN (MIN/MAX)     | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this makes a dramatic difference to the work Oracle has to do – in this case 2,414 buffer gets and 845K rows examined. As I said in yestrday’s post – there’s a patch for 11.2.0.4, so there could be a patch for 12.1.0.2 if you ask for it, but it looks like no-one has done so yet.

<h3>Footnote:</h3>

I could have used a lateral() view in the first half of bounce1 to reduce the reported number of probes of pt1_i1 in the plan – but it made the code extremely messy, I had to include a /*+ no_decorrelate */ hint in it, and it increased the number of buffer visits slightly because the optimizer seemed to lose the option for a min/max scan in this particular lateral join.

 

Min/Max upgrade

Thu, 2018-05-31 08:13

Here’s a nice little optimizer enhancement that appeared in 12.2 to make min/max range scans (and full scans) available in more circumstances. Rather than talk through it, here’s a little demonstration:

rem
rem     Script:         122_minmax.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Good path
rem             12.1.0.2        Bad path

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10)  -- > comment to avoid format wordpress issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

alter session set statistics_level = all;
set serveroutput off
set linesize 156
set pagesize 60
set trimspool on

select  min(status) from pt1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = 'INVALID';
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = (select min(status) from pt1);
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

The basic “min/max” optimisation allows Oracle to avoid a massive sort aggregate – Oracle doesn’t need to acquire a lot of data and sort it when it knows that the “left hand” end of an index is the low values and the “right hand” is the high values so, for example, in the first query above the optimizer could simply walk down the index branches to the left hand leaf and look at the single lowest entry in the leaf block to determine the lowest value for status … if the index had been a global index.

Things get a little messy, though, when the index is locally partitioned and your query isn’t about the partition key and there’s no suitable global index. Once upon a time (IIRC) Oracle would simply have to do an index fast full scan across all index partitions to handle such a query, but some time ago it got a lot cleverer and was enhanced to do a min/max scan on each partition in turn getting one value per partition very efficiently, then aggregating across those values to find the global minimum.

Here are the three execution plans (with rowsource execution stats pulled from memory) taken from 12.1.0.2 for the queries above:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   3 |    INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
-----------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE               |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL          |        |      1 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
|   3 |    FIRST ROW                  |        |      4 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      12 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
-------------------------------------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |        |      1 |        |   337 (100)|       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   1 |  SORT AGGREGATE                |        |      1 |      1 |            |       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   2 |   PARTITION HASH ALL           |        |      1 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|*  3 |    INDEX FAST FULL SCAN        | PT1_I1 |      4 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|   4 |     SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |      0 |
|   5 |      PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
|   6 |       INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
-----------------------------------------------------------------------------------------------------------------------------------------

In the first plan Oracle has done an “index full scan (min/max)” across each of the four partitions in turn to return one row very cheaply from each, then aggregated to find the overall minimum.

In the second plan Oracle has done an “index range scan (min/max)” in exactly the same way, since it was able to find the start point in the index for the status ‘INVALID’ very efficiently.

In the third plan Oracle has been able to find the minimum value for the status (‘INVALID’) very efficiently in the subquery, and has passed that single value up to the main query, which has then used a brute force approach to search the whole of every partition of the index for every occurrence (all 10 of them) of the value ‘INVALID’ and then aggregated them to find the minimum namespace. Despite “knowing”, by the time the main query runs, that there will be a single value to probe for the status, the optimizer has not anticipated the fact that the final query will effectively become the same as the preceding one. As a result we’ve read 2,242 data blocks into the cache.

Turn, then, to the execution plan from 12.2.0.1 for this last query:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE                 |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   2 |   PARTITION HASH ALL            |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   3 |    FIRST ROW                    |        |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)  | PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   5 |      SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------------------------------------------------

In 12.2 you can see that the main query is now doing an “index range scan (min/max)” on each index partition in turn, based on the incoming (though unknown at parse time) single value from the subquery. As a result the total work done is a mere 24 buffer visits.

There have been a couple of occasions in the past where I’ve had to write some PL/SQL to work around little details like this. It’s nice to know simple tables and partitoned tables with local indexes can now behave the same way. I also wonder whether there may be sites that could drop (or drop columns from, or make local) some indexes that they’ve previously created to  handle queries of the “most recent occurence” type.

If, for any reason, you need to disable this enhancement, it’s controlled by fix_control (v$system_fix_control) “18915345 Allow MIN/MAX optimization for pred having single row subquery” which can be set in the startup file, at the system level, or in the session.

Update

Checking MoS for the bug number I found that the limitation had been reported for 11.2.0.3, with “Fixed in product version” reported as 12.2; but there are patches for various releases of 11.2.0.4, though none yet for 12.1.0.2 – but if you think you need it you can always try raising an SR.

 

Index Bouncy Scan 3

Wed, 2018-05-30 07:15

This is a follow-up to a problem I had with yesterday’s example of using recursive CTEs to “bounce” along a multi-column index to pick out the unique set of combinations of the first two columns. Part of the resulting query used a pair of aggregate scalar subqueries in a select list – and Andrew Sayer improved on my query by introducing a “cross apply” (which I simply hadn’t thought of) which the optimizer transformed into a lateral view (which I had thought of, but couldn’t get to work).

After seeing what the Andrew and the optimizer had done I looked a little more closely at my lateral view experiment and modified it so that it worked. Here are the three critical versions of the relevant code fragment; first is my original code, then Andrew’s cross apply, then my working lateral view version:

select
        (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
        (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
from    bounce1
where   bounce1.val1 is not null
 
 
select
        ca.val1 ,ca.val2
from    bounce1
cross  apply (select val1, val2
              from  (select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
                    )
             ) ca
where  bounce1.val1 is not null
 
----

select
        ca.val1 ,ca.val2
from    bounce1, 
        lateral(select val1, val2
              from  (select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
                    )
             ) ca
where  bounce1.val1 is not null

All I’ve done to modify Andrew’s code is put a comma after the table (actually CTE) bounce1, then change “cross apply” to “lateral”. Compare the resulting text with the following lateral version that doesn’t work:


select
        ca.val1 ,ca.val2
from    bounce1, 
        lateral (
                   select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
             ) ca
where  bounce1.val1 is not null

To get from not working to working all I’ve done is wrap the text in my lateral() subquery inside one more (apparently redundant) layer of “select * from ()”!

In fact my assumption that my code wasn’t working was incorrect – what was really going on was that the code I had written was producing the wrong results but I thought that I had made a mistake in the way I was writing it and couldn’t figure out what I had done wrong.

Problem Solving:

To get a better idea of what’s going on, I took a closer look at the execution plans. Here are the plans (main body only) for the two variants of using the lateral() view – the first from the SQL with the “redundant” select, the second as I originally wrote it. Notice that the number of rows (A-Rows) returned in the first case is the 30 expected while in the second case it’s only 10.


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   125 (100)|     30 |00:00:00.01 |      40 |     28 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   125   (3)|     30 |00:00:00.01 |      40 |     28 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   124   (2)|     30 |00:00:00.01 |      40 |     28 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     33 |00:00:00.01 |      40 |     28 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    61   (2)|      3 |00:00:00.01 |       8 |      4 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      3 |00:00:00.01 |       8 |      4 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 |      1 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |      1 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      3 |      1 |    31   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      3 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|  11 |        VIEW                                  | VW_LAT_1BBF5C63 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  12 |         VIEW                                 |                 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|* 13 |          COUNT STOPKEY                       |                 |      3 |        |            |      2 |00:00:00.01 |       6 |      3 |       |       |          |
|* 14 |           INDEX RANGE SCAN                   | T1_PK           |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  15 |     SORT AGGREGATE                           |                 |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  16 |      FIRST ROW                               |                 |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 17 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  18 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   125 (100)|     10 |00:00:00.01 |      16 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   125   (3)|     10 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   124   (2)|     10 |00:00:00.01 |      16 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     11 |00:00:00.01 |      16 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    61   (2)|      1 |00:00:00.01 |       4 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      1 |00:00:00.01 |       4 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      1 |      1 |    31   (0)|      0 |00:00:00.01 |       2 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      1 |        |            |      1 |00:00:00.01 |       0 |       |       |          |
|* 11 |        VIEW                                  | VW_DCL_1BBF5C63 |      1 |      1 |     2   (0)|      0 |00:00:00.01 |       2 |       |       |          |
|* 12 |         COUNT STOPKEY                        |                 |      1 |        |            |      1 |00:00:00.01 |       2 |       |       |          |
|  13 |          INDEX FULL SCAN                     | T1_PK           |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|  14 |     SORT AGGREGATE                           |                 |     10 |      1 |            |     10 |00:00:00.01 |      12 |       |       |          |
|  15 |      FIRST ROW                               |                 |     10 |      1 |     2   (0)|      9 |00:00:00.01 |      12 |       |       |          |
|* 16 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     10 |      1 |     2   (0)|      9 |00:00:00.01 |      12 |       |       |          |
|  17 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     10 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Most importantly we can see that the optimizer has used two different transformations. For the working query we see the view name VW_LAT_xxxxxxxx at operation 11, this is Oracle implementing a lateral view; for the problem query we see the view name VW_DCL_xxxxxxxx at operation 11, which is Oracle implementing a transformation to a “decorrelated lateral view”.

My first test after noting this difference was to see what would happen in I added the hint /*+ no_query_transformation */ to the query: it resulted in the VW_DCL_xxxxxxxx view name changing to VW_LAT_xxxxxxxx and the query producing the right result. Andrew Sayer, on the ODC thread, then pointed out that he’d done a couple more experiments and used the /*+ no_decorrelate() */ hint so I tried that with my query, adding it (with no parameters) to the subquery inside the lateral() clause – again the plan changed from using VW_DCL to VW_LAT and the results were correct.

Test Case

Bottom line on this – it looks like the optimizer is decorrelating a subquery when it shouldn’t, leading to wrong results. To make it easier to see this anomaly I stripped the original sample down to a basic test case starting with the table that I used in the previous posting:

rem
rem     Script:         decorralate.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem
rem     Last tested 
rem             18.1.0.0  -- via liveSQL
rem             12.2.0.1
rem             12.1.0.2
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

Now two versions of a simplified piece of code that should select the distinct values of val1 greater than the lowest value (each row in the UNION ALL of dual is emulating the way in which yesterday’s recursive CTE was effectively saying “this is a current known value, find the next higher”):


prompt  =============
prompt  Right results
prompt  =============

select
        v1.val1, v1.val2
from    (
        select  0 val1, 0 val2 from dual
        union all
        select 1,0 from dual
        union all
        select 2,0 from dual
        ) bounce1,
        lateral (
            select val1, val2 from (
              select  /*+ index(t1) no_index_ffs(t1) */
                      t1.val1, t1.val2
              from    t1
              where   t1.val1 > bounce1.val1
              and     rownum = 1
            )
        ) v1
;

prompt  ===========================================
prompt  Wrong results -- "redundant" select removed
prompt  ===========================================

select
        v1.val1, v1.val2
from    (
        select  0 val1, 0 val2 from dual
        union all
        select 1,0 from dual
        union all
        select 2,0 from dual
        ) bounce1,
        lateral (
            -- select val1, val2 from (
              select  /*+ index(t1) no_index_ffs(t1) */
                      t1.val1, t1.val2
              from    t1
              where   t1.val1 > bounce1.val1
              and     rownum = 1
            -- )
        ) v1
;

Here’s a cut-n-paste from running the two queries:


=============
Right results
=============

      VAL1       VAL2
---------- ----------
         1          0
         2          0

2 rows selected.

============================================
Wrong results  -- "redundant" select removed
============================================

no rows selected

Finally, to get an idea of what’s gone wrong – and to show that the optimizer has done something wrong when attempting to decorrelate – we can take a look at the optimizer trace file to see the final transformed SQL that the optimizer has produced a plan for. (I enabled the trace with the command “alter session set events ‘trace [rdbms.SQL_Transform.*]’;” to limit the trace to just the information about optimizer transformations.) This – cosmetically altered – is the final “unparsed” query:

select 
        vw_dcl_a18161ff.val1 val1,
        vw_dcl_a18161ff.val2 val2 
from    ( 
                (select 0 val1 from sys.dual dual) 
                union all  
                (select 1 1 from sys.dual dual) 
                union all  
                (select 2 2 from sys.dual dual)
        ) bounce1, 
        (
        select
                 /*+ no_index_ffs (t1) index (t1) */ 
                t1.val1 val1_0,
                t1.val2 val2_1 
        from
                test_user.t1 t1
        where 
                rownum = 1
        ) vw_dcl_a18161ff 
where 
        vw_dcl_a18161ff.val1 > bounce1.val1

As you can see, the lateral view has turned into a non-mergeable inline view which selects the first row available from t1 by following the supplied hints, and joins that single row result set to bounce1. I have a suspicion that lateral views which include rownum predicates should not be decorrelated. I have looked on MoS to see if I can find any bugs related to decorrelating lateral views, but either there are none or my search terms weren’t good enough.

 

Upgrades

Wed, 2018-05-30 04:08

One of my maxims for Oracle performance is: “Don’t try to be too clever”. Apart from the obvious reason that no-one else may be able to understand how to modify your code if the requirements change at a future date, there’s always the possibility that an Oracle upgrade will mean some clever trick you implemented will simply stop working.

While searching for information about a possible Oracle bug recently I noticed the following fix control (v$system_fix_control) in 12.2.0.1:


     BUGNO OPTIMIZE SQL_FEATURE                        DESCRIPTION                                                             VALUE
---------- -------- ---------------------------------- ---------------------------------------------------------------- ------------
  18385778          QKSFM_CARDINALITY_18385778         avoid virtual col usage if FI is unusable or invisible 

Maybe that’s just invalidated an idea I published 12 years ago.

I haven’t researched the bug or any underlying SR, but I can think of valid argument both for and against the fix as described.

 

 

Index Bouncy Scan 2

Tue, 2018-05-29 06:27

I wrote a note some time last year about taking advantage of the “index range scan (min/max)” operation in a PL/SQL loop to find the small number distinct values in a large single column index efficiently (for example an index that was not very efficient but existed to avoid the “foreign key locking” problem. The resulting comments included pointers to other articles that showed pure SQL solutions to the same problem using recursive CTEs (“with” subqueries) from Markus Winand and Sayan Malakshinov: both writers also show examples of extending the technique to cover more cases than the simple list of distinct values.

The topic came up again on the ODC (OTN) database forum a couple of days ago; one of the replies linked back to my original posting, another gave the recursive solution for a single column index – so I ended up seeing the following question twice, once as a comment on my blog, once in the forum: “Can you extend this method to a two column index, what about an N column index ?”

Here’s a walk-through of working out one possible solution for the two-column requirement – how to find all the distinct combinations for the first two columns of a very large index without having to scan and aggregate the whole index. We start with a suitable table and index.


rem
rem     Script:         bouncy_index.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

I’ve created a table with 3 values for val1, 10 values for val2, with a total of 30 combinations. The addition of the primary key starting with (val1, val2) is just a lazy way to ensure that I have a suitable index AND val1 and val2 are both declared not null.

With this data my first step will be to demonstrate the recursive CTE (“with” subquery) used by Andrew Sayer in the ODC posting to get the distinct values for val1 using three index “index range scan (min/max)”probes. I’ve included the in-memory execution plan with rowsource execution stats to show that this does a minimal amount of work.

The results in this note come from 12.2.0.1:


set serveroutput off
alter session set statistics_level = all;

with bouncy (val1)
as (
        select  min(val1) val1
        from    t1
        union all
        select  (select min(t1.val1) val1 from t1 where t1.val1 > bouncy.val1) val1
        from    bouncy
        where   bouncy.val1 is not null
    )
select  *
from    bouncy
where   bouncy.val1 is not null
order by
        val1
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    19 (100)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      2 |    19   (6)|      3 |00:00:00.01 |       7 |      4 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      2 |    18   (0)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |       7 |      4 |  1024 |  1024 |          |
|   4 |     SORT AGGREGATE                         |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   5 |      INDEX FULL SCAN (MIN/MAX)             | T1_PK |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   6 |     SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |      3 |       |       |          |
|   7 |      FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|*  8 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|   9 |     RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("BOUNCY"."VAL1" IS NOT NULL)
   8 - access("T1"."VAL1">:B1)

As you can see I’ve done an “index full scan (min/max)” as the first step of the recursive query, visiting just two buffered blocks (the index leaf-block count is 27 – roughly 9 per value of val1 – so Oracle is clearly doing an efficient access for that value, it’s not rally a “full” scan. We then see 3 “index range scan (min/max)” at roughly 2 buffer visits each to collect the remaining values. (There’s probably a small saving in buffer gets due to the pinning that takes place).

So we can get the val1 values very easily and efficiently with this recurstive CTE technology. Let’s write some code that uses the same technology to find the val2 values for each possible val1 value in turn:

with bounce2 (val1, val2)
as (
        select val1, val2 from (
                select  0 val1, 0 val2 from dual
                union all
                select 1,0 from dual
                union all
                select 2,0 from dual
        )
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    27 (100)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      6 |    27   (4)|     30 |00:00:00.01 |      32 |     24 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      6 |    26   (0)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |     33 |00:00:00.01 |      32 |     24 |  1024 |  1024 |          |
|   4 |     VIEW                                   |       |      1 |      3 |     6   (0)|      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      UNION-ALL                             |       |      1 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   8 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |     SORT AGGREGATE                         |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  10 |      FIRST ROW                             |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 11 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  12 |     RECURSIVE WITH PUMP                    |       |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
  11 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


In this example of the code the second half of the CTE looks remarkably similar to the previous statement – except I now have a two-column CTE and I’ve included an equality predicate against val1 based on the first of the two columns. In the first half of the code I’ve cheated (as a temporary measure) and supplied three rows of data which list the three distinct values of val1 with their associated minimum values for val2.

The execution plan shows that I’ve done 30 “index range scan (min/max)” of the index with 32 buffer visits. And that’s exactly the right number of probes to return my result set. So if I can manage to generate the starting values efficiently I can execute the whole query efficiently. So let’s find a way of changing that “union all on dual” fudge into a generic statement. Let’s replace it with a recursive CTE:


with bounce1(val1, val2) as (
        select val1, val2 
        from    (
                select
                        /*+ index(t1) */
                        val1, val2,
                        row_number() over(order by val1, val2) rn
                from    t1
        )
        where
                rn = 1
        union all
        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null
),
bounce2 (val1, val2)
as (
        select  val1, val2 
        from    bounce1
--      where   bounce1.val1 is not null
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |       |      1 |        |   189 (100)|     30 |00:00:00.01 |      45 |       |       |          |
|   1 |  SORT ORDER BY                               |       |      1 |      4 |   189   (2)|     30 |00:00:00.01 |      45 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |       |      1 |      4 |   188   (2)|     30 |00:00:00.01 |      45 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |       |      1 |        |            |     34 |00:00:00.01 |      45 |  1024 |  1024 |          |
|   4 |     VIEW                                     |       |      1 |      2 |    87   (2)|      4 |00:00:00.01 |      13 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |      13 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |       |       |          |
|  10 |        FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|  12 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       6 |       |       |          |
|* 13 |        COUNT STOPKEY                         |       |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |         INDEX RANGE SCAN                     | T1_PK |      3 |    500 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |       RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  16 |     SORT AGGREGATE                           |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  17 |      FIRST ROW                               |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 18 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  19 |     RECURSIVE WITH PUMP                      |       |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 11 - access("T1"."VAL1">:B1)
  13 - filter(ROWNUM=1)
  14 - access("T1"."VAL1">:B1)
  18 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


Again we see 30 probes using “index range scan (min/max)” with 32 buffer gets to get 30 rows; plus a further 13 buffer gets to generate the three driving rows. The 13 buffer gets break down to: 2 to get the minimum (val1, val2) combination using an “index full scan (min/max)”, then 5 for the probes to get the three minimum values for val1, and 6 for the probes to get the three corresponding minimum values of val2.

You’ll notice that I’ve got various “is not null” predicates scattered throughout the code. In some cases this is to stop Oracle from running into an infinite loop and reporting Oracle error: ORA-32044: cycle detected while executing recursive WITH query” This will occur because of the way that “(select max()…)” inline scalar subqueries returning a null if there is no data found which would lead to the next cycle of the recursive descent taking that null as an input – hence starting the infinite recursion. In some cases the “is not null” predicates are my default pattern for recurstive CTEs and some of them could probably be removed with no change in meaning (or workload).

The /*+ index() */ hint in the starting point for bounce1 was necessary to avoid an “index fast full scan” in 12.2; but that was purely a case of the statistics – number of distinct values, leaf_block count, etc – making the optimizer pick an option that was appropriate for this tiny data set, but not appropriate for the demonstration.  In fact this looks like the side effect of two defects in the 12.1 optimizer code, of which only one has been fixed in 12.2.

Optimizer Limitations

Here’s an extract from the execution plan for the final query with an /*+ index(t1) */ hint in place. The extract is identical for 12.1.0.2 and 12.2.0.1:

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |

You’ll notice the Cost at operation 8 is appropriate for a real (i.e. all leaf blocks) full scan of the index. (The leaf_block value was 27 as I mentioned earlier on). You’ll also see that the OMem (PGA requirement for optimum workarea operation) figure is consistent with Oracle processing 10,000 rows in the index. Since the optimizer managed to work out that it could do a full scan with nosort and stopkey it looks a little surprising that the algorithms didn’t manage to make some allowance for the limited access that would occur. (I’d view this as a current limitation, rather than a bug, though).

Now compare the equivalent extracts when we hint an index fast full scan 12.1.0.2 first, then 12.2.0.1:

12.1.0.2
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    39   (8)|      1 |00:00:00.03 |      32 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |    39   (8)|      1 |00:00:00.03 |      32 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      32 |       |       |          |

12.2.0.1
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |     7  (29)|      1 |00:00:00.01 |      34 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |     7  (29)|      1 |00:00:00.01 |      34 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      34 |       |       |          |

In both cases the cost of the index fast full scan is the same – and much cheaper; but in 12.1.0.2 the cost of the query looks as if it is allowing for sorting (and spilling) the entire 10,000 rows of returned from the index fast full scan (even though the OMem indicates otherwise), while the cost in 12.2.0.1 looks as if it recognises that it just has to do a running comparison through the data set as it returns, keeping only the current minimum in memory at any one moment. This clearly matches our expectations of how Oracle ought to behave, which is why I’d call this a bug in 12.1, fixed by 12.2.

The dramatic change in cost of operation 7 on the upgrade explains the change in plan and the necessity for the /*+ index(t1) */ hint – but if the “first row” predicate were also reflected in the costing then the cost of the “stopkey” index full scan would drop to 2 (probably) and the original 12.1 path would be re-appear.

Footnote

I don’t think there’s a lot of scope for improving the efficiency of this query for getting the (relatively) small number of distinct combinations from the first two columns of a very large index – but there are some very clever SQL bunnies on the ODC forum, so I won’t be surprised if someone comes up with a better solution.

Filtering LOBs

Mon, 2018-05-28 02:25

A two-part question about the FILTER operation appeared on the Oracle-L list server a couple of days ago. The first part was a fairly common question – one that’s often prompted by the way the optimizer used to behave in older versions of Oracle. Paraphrased, it was: “Why is the total cost of the query so high compared to the sum of its parts?”

Here’s the query, and the execution plan.

 INSERT INTO TEMP
  SELECT DISTINCT 'id',
    PHT.emplid
  FROM PHOTO PHT
  WHERE 1               =1
  AND PHT.PHOTO IS NOT NULL
  AND NOT EXISTS
    (SELECT 'x'
    FROM TEMP TMP
    WHERE PHT.EMPLID=TMP.EMPLID_SRCH
    AND TMP.OPRID  = 'id'
    )
  ;  

  
-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |          | 21210 |  3334K|  5802K  (2)| 00:03:47 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP     |       |       |            |          |
|*  2 |   FILTER                 |          |       |       |            |          |
|*  3 |    TABLE ACCESS FULL     | PHOTO    | 21211 |  3334K|   313   (1)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL     | TEMP     |     1 |    17 |   380   (2)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT 0 FROM " TEMP" "TMP" WHERE
              "TMP"."EMPLID_SRCH"=:B1 AND "TMP"."OPRID"='id'))
   3 - filter("PHT"."PHOTO" IS NOT NULL)
   4 - filter("TMP"."EMPLID_SRCH"=:B1 AND "TMP"."OPRID"='id')

Note that the “not exists” subquery against temp runs as a filter subquery with a cost of 380 for the tablescan. Combine that with the cost of 313 for the driving tablescan of photo and you might wonder why the resulting cost isn’t something like 693 – and in some old versions of Oracle that’s probably how it would be reported.

Historically the optimizer has been very bad about producing a final cost when queries have included subqueries – whether as filter subqueries in the predicate section or as scalar subqueries in the select list. Sometimes the cost would simply vanish from the final cost, sometimes it would be added just once to the final cost regardless of how many times the subquery might actually execute.

In this example the subquery against temp is a correlated subquery and might have to run once for every row in photo where the column photo was not null. At best it would have to run at least once for every distinct value of the photo.emplid column (the correlation column) found in those rows. In recent versions of Oracle the optimizer has tried to introduce some estimate of how many times the subquery would run as part of its calculation of the total cost. So (to a crude approximation) 5802K = 313 + N * 380. Unfortunately if we try to work backwards to N we find it would be about 15,267 which is about 72% of the 21,200 rows estimated as the result of the tablescan of photo – I haven’t tried to investigate the algorithms yet but presumably the optimizer makes some allowances somewhere for “self caching” as the subquery runs.

The more interesting part of the question came when the OP decided to test the effect of getting rid of the subquery. Check the costs in the resulting plan:


  INSERT INTO TEMP
  SELECT DISTINCT 'id',
    PHT.emplid
  FROM PHOTO PHT
  WHERE 1               =1
  AND PHT.UC_PBI_PHOTO IS NOT NULL;

  
-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |          | 21211 |  3334K|  3659   (1)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP     |       |       |            |          |
|*  2 |   TABLE ACCESS FULL      | PHOTO    | 21211 |  3334K|  3659   (1)| 00:00:01 |
-------------------------------------------------------------------------------------
  
Predicate Information (identified by operation id):
---------------------------------------------------
     2 - filter("PHT"."PHOTO" IS NOT NULL)

Note how the cost of the tablescan of photo has gone up from 313 in the previous query to 3,659 in the simpler query! How can a tablescan that drives a subquery have a lower cost than the tablescan on its own? Bear in mind that in both cases the Cost attributed to the operation “Table Access Full” is purely about scanning the rows in the photo table and is (or should be) entirely disconnected from the cost and frequency of the subquery.

The clue is in the table definition. The column photo.photo is a BLOB.

Models

I think there are potentially two errors in the optimizer displayed by this example. The first is that it’s adding in a cost that it shouldn’t even be considering; the second is that it’s being inconsistent in the way that it’s deriving that cost.

To demonstrate what I think is happening, I built a variant of the OP’s example as follows:


rem
rem     Script:         optimizer_lob_costs.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table photo (
        emplid          varchar2(11) not null,
        photo           clob,
        other_col       varchar2(1000)
)
lob (photo) 
store as
        photo_lob(
        disable storage in row 
        cache
        logging
)
;

create unique index ph_uk on photo(emplid);

insert /*+ append */ into photo
select
        lpad(2 * rownum,10,0),
        rpad('x',1000),
        rpad('x',1000)
from
        all_objects
where
        rownum <= 10000 -- > comment to avoid wordpress format issue
;

commit;

create table temp(
        oprid           varchar2(30),
        emplid_srch     varchar2(11)
)
;

insert /*+ append */ into temp
select
        'id',
        lpad(2 * rownum,10,0)
from
        all_objects
where
        rownum <= 1000 -- > comment to avoid wordpress format issue
;

commit;

execute dbms_stats.gather_table_stats(user,'photo',method_opt=>'for all columns size 1', cascade=>true)
execute dbms_stats.gather_table_stats(user,'temp', method_opt=>'for all columns size 1', cascade=>true)


I’ve changed the BLOB to a CLOB defined with storage in row disabled, and I’ve introduced a varchar2() column of the same size as the CLOB column. I’ve declared the correlating column not null and created a unique index on it. Here are the two queries I want to review – slightly simplified versions of the original:


explain plan for
insert into temp(emplid_srch)
select 
        distinct pht.emplid
from 
        photo pht
where 
        1 = 1
and  pht.photo is not null
-- and     pht.other_col is not null
and     not exists (
                select /*+ no_unnest */
                        null
                from 
                        temp tmp
                where 
                        pht.emplid=tmp.emplid_srch
        )
;  

select * from table(dbms_xplan.display);

explain plan for
insert into temp(emplid_srch)
select
        distinct pht.emplid
from    photo pht
where   1               =1
and  pht.photo is not null
-- and     pht.other_col is not nulL
;  

select * from table(dbms_xplan.display);

As you can see I’ve had to include a /*+ no_unnest */ hint in my SQL to get the FILTER operation to appear in the plan (the OP had the hidden parameter “_unnest_subquery” set to false); I’ve also allowed for two variants of each query, one referencing the CLOB column the other referencing the varchar2() column. The only results I’ll show are for the queries accessing the CLOB, and here are the plans first with, then without, the subquery. Check the cost of the tablescan of the photo table in the two cases:


----------------------------------------------------------------------------------
| Id  | Operation                | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |       |  9999 |   956K| 10458   (3)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP  |       |       |            |          |
|*  2 |   FILTER                 |       |       |       |            |          |
|*  3 |    TABLE ACCESS FULL     | PHOTO | 10000 |   957K|   216   (1)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL     | TEMP  |     1 |    11 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "TEMP" "TMP"
              WHERE "TMP"."EMPLID_SRCH"=:B1))
   3 - filter("PHT"."PHOTO" IS NOT NULL)
   4 - filter("TMP"."EMPLID_SRCH"=:B1)


----------------------------------------------------------------------------------
| Id  | Operation                | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |       | 10000 |   957K|   285   (2)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP  |       |       |            |          |
|*  2 |   TABLE ACCESS FULL      | PHOTO | 10000 |   957K|   285   (2)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("PHT"."PHOTO" IS NOT NULL)

With the subquery in place the tablescan of photo reports a cost of 285, in the absence of the subquery it reports a cost of 216, a difference of 69. Repeating the test but using the varchar2() column the cost of the tablescan was 213 in both cases – suggesting that the variation was due to the column being a LOB.

With no further clues in the plan it looked like one of those rare occasions when I have to look at the 10053 (optimizer) trace file – and this is what I got from the 12.1.0.2 trace, looking at the section headed “SINGLE TABLE ACCESS PATH” for the photo table. First the base query without the subquery:


SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for PHOTO[PHT]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Column (#2): PHOTO(LOB)
    AvgLen: 87 NDV: 0 Nulls: 0 Density: 0.000000
  Table: PHOTO  Alias: PHT
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.000000  Non Adjusted: 10000.000000
  Scan IO  Cost (Disk) =   210.000000
  Scan CPU Cost (Disk) =   13571440.480000
  Total Scan IO  Cost  =   210.000000 (scan (Disk))
                         + 70.000000 (io filter eval) (= 0.007000 (per row) * 10000.000000 (#rows))
                       =   280.000000
  Total Scan CPU  Cost =   13571440.480000 (scan (Disk))
                         + 9138463.200000 (cpu filter eval) (= 913.846320 (per row) * 10000.000000 (#rows))
                       =   22709903.680000

Note the “Total Scan IO Cost” described at line 13 includes a component at line 12 labelled “(io filter eval)” – why, for the predicate “photo is null”, would we do any special I/O when that predicate can be met in the basic table scan.

(Note: A predicate like “lob_column is null” means there is no lob locator in place, so no lob access need be done for that test. In fact the related, but very different, predicate “length(lob_column) = 0” meaning the lob locator exists but the lob is “empty” could also be satisfied during the tablescan without reference to the physical lob segment(s) because the length of the lob is included in the lob locator.)

Let’s assume that the optimizer is incorrectly assuming the run-time engine will have to access the lob in some way to determine that the lob is null. The worst case scenario is that Oracle will start by accessing the LOBindex – so why don’t we check how big the LOBindex is. The first step I took was to check the object_id of the LOBindex and then do a tree dump (which showed 66 leaf blocks) and then I checked the segment header block and dumped that with the following results:


  Extent Control Header
  -----------------------------------------------------------------
  Extent Header:: spare1: 0      spare2: 0      #extents: 1      #blocks: 127
                  last map  0x00000000  #maps: 0      offset: 4128
      Highwater::  0x01400447  ext#: 0      blk#: 70     ext size: 127
  #blocks in seg. hdr's freelists: 4
  #blocks below: 70
  mapblk  0x00000000  offset: 0
                   Unlocked
     Map Header:: next  0x00000000  #extents: 1    obj#: 194295 flag: 0x40000000
  Extent Map
  -----------------------------------------------------------------
   0x01400401  length: 127

See the “Highwater::” information at line 6 – the allocated space in the segment is the first 70 blocks of the first extent. That’s (almost certainly) where the incremental cost of 70 (single block I/Os) comes from.  (And I did couple of big updates to the LOB, designed to expand the LOBindex without changing the segment size of the underlying table, to corroborate that hypothesis.)

This brings us to the question of why the cost of the tablescan drops when the subquery is included. Again we generate the 10053 trace and examine the details under the “SINGLE TABLE ACCESS PATH”:


SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for PHOTO[PHT]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: PHOTO  Alias: PHT
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.000000  Non Adjusted: 10000.000000
  Scan IO  Cost (Disk) =   210.000000
  Scan CPU Cost (Disk) =   13571440.480000
  Total Scan IO  Cost  =   210.000000 (scan (Disk))
                         + 3.500000 (io filter eval) (= 0.000350 (per row) * 10000.000000 (#rows))
                       =   213.500000
  Total Scan CPU  Cost =   13571440.480000 (scan (Disk))
                         + 656923.160000 (cpu filter eval) (= 65.692316 (per row) * 10000.000000 (#rows))
                       =   14228363.640000


In this case the “(io filter eval)” at line 10 is only 3.5 – and if you know your optimizer and how it handles subqueries you’re allowed to guess that could be one of Oracle’s standard guesses of 5% coming into play. (Again, growing the index seemed to corroborate this hypothesis.)

So here’s (possible) bug number 2: the first bug is adding a cost for accessing the LOBindex when there should be no need to  access the index at all – the execution plan says we will get 10,000 rows from the table, the filter predicate does report a cardinality reduced by just 1 on a column that has been declared with a uniqueness constraint, but a fairly commonly used “guess factor” of 5% is used as an indicator of the number of times the lob predicate will be tested. The various bits of the arithmetic are not consistent with each other.

Summary notes:

If you have a tablescan with a predicate that references a lob column then the cost of the tablescan includes the cost of the lob access – and there are cases where lob access is not needed but still gets costed {this is bug number 1 – the predicates are column is/is not null, and length(column) = / != 0)}.

If the lob data itself does not need to be accessed then the size of the lob index – which you can’t easily find – may have a significant impact on the cost of the tablescan.

If the query also includes predicates that result in the optimizer guessing about cardinality effects (1%, 5%, 0.25% are common guesses) then that guess may be used to scale the assumed (and potentially irrelevant) cost of the lob access. (There is scope for further experimentation in this area to examine the effects of “non-guess” predicates and the assumed order of application of predicates, e.g. are lob predicates costed as the last to be applied, does the algorithm for costing matched the execution order.)

As often happens it’s easy to see that there are oddities in the arithmetic that affect the cost of a query in ways that might make the optimizer pick a silly execution plan. Unfortunately it’s not easy to predict when you’re likely to see the effects of these oddities; the best we can do is remember that there is an anomaly with costing lob-based predicates and hope that we think of it when we see the optimizer picking a bad plan for reasons that initially are not obvious.

Show parameter

Fri, 2018-05-25 08:01

Just a quick little tip for Friday afternoon.

If you use the “show parameter” or “show spparameter” commands from SQL*Plus you’ve probably noticed that the parameter value may be too long for the defined output column, and even the parameter name may occasionally be too long. For example (from 12.2.0.1):

SQL> show spparameter optimizer

SID	 NAME			       TYPE	   VALUE
-------- ----------------------------- ----------- ----------------------------
*	 optimizer_adaptive_plans      boolean
*	 optimizer_adaptive_reporting_ boolean
	 only
*	 optimizer_adaptive_statistics boolean
*	 optimizer_capture_sql_plan_ba boolean
	 selines
*	 optimizer_dynamic_sampling    integer
*	 optimizer_features_enable     string
*	 optimizer_index_caching       integer
*	 optimizer_index_cost_adj      integer
*	 optimizer_inmemory_aware      boolean
*	 optimizer_mode                string
*	 optimizer_secure_view_merging boolean
*	 optimizer_use_invisible_index boolean
	 es
*	 optimizer_use_pending_statist boolean
	 ics
*	 optimizer_use_sql_plan_baseli boolean
	 nes

SQL> show parameter audit

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
audit_file_dest                      string      /u01/app/oracle/admin/orcl12c/
                                                 adump
audit_sys_operations                 boolean     TRUE
audit_syslog_level                   string
audit_trail                          string      NONE
unified_audit_sga_queue_size         integer     1048576

The column definitions for these displays are embedded in the the SQL*Plus library code ($ORACLE_HOME/lib/libsqlplus.so), and the effects can be seen by spooling the output from the basic “column” command from SQL*Plus and searching for the word ‘param’ (ignoring case). The relevant column definitions are:


COLUMN   SID_COL_PLUS_SHOW_SPPARAM      ON      FORMAT    a8   HEADING SID
COLUMN   VALUE_COL_PLUS_SHOW_SPPARAM    ON      FORMAT   a28   HEADING VALUE
COLUMN   NAME_COL_PLUS_SHOW_SPPARAM     ON      FORMAT   a29   HEADING NAME

COLUMN   value_col_plus_show_param      ON      FORMAT   a30   HEADING VALUE
COLUMN   name_col_plus_show_param       ON      FORMAT   a36   HEADING NAME

If you want to change a few of the lengths (or even the column headings) you can simply add modified copies of these commands to the glogin script ($ORACLE_HOME/sqlplus/admin/glogin.sql) or to a login.sql script that’s referenced in your defined SQL path (and make sure you take a look at Franck Pachot’s blog to see how that has changed in 12.2).

 

Missing Audit

Thu, 2018-05-24 04:27

Here’s a detail I discovered a long time ago – and rediscovered very recently: it’s possible to delete data from a table which is subject to audit without the delete being audited. I think the circumstances where it would matter are a little peculiar, and I’ve rarely seen systems that use the basic Oracle audit feature anyway, but this may solve a puzzle for someone, somewhere, someday.

The anomaly appears if you create a referential integrity constraint as “on delete cascade”. A delete from the parent table will automatically (pre-)delete matching rows from the child table but the delete on the child table will not be audited. Here’s a demonstration script – note that you will need to have set the parameter audit_trail to ‘DB’ to prove the point.


rem
rem     Script:         del_cascade_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2004
rem
rem     Last tested
rem             12.1.0.2
rem             11.2.0.4
rem

drop table t2 purge;
drop table t1 purge;

create table t1 (
        id              number(6),
        v1              varchar2(10),
        padding         varchar2(100),
        constraint t1_pk 
                primary key (id)
);


create table t2 (
        id_par          number(6),
        id_ch           number(6),
        v1              varchar2(10),
        padding         varchar2(100),
        constraint t2_pk 
                primary key (id_par,id_ch),
        constraint t2_fk_t1 
                foreign key (id_par) references t1 
                on delete cascade
);


insert into t1
select
        rownum,
        rownum,
        rpad('x',100)
from
        all_objects
where
        rownum <= 100 -- > comment to avoid wordpress format issue
;


insert into t2
select
        1+trunc((rownum-1)/5),
        rownum,
        rownum,
        rpad('x',100)
from
        all_objects
where
        rownum <= 500 -- > comment to avoid wordpress format issue
;

commit;

prompt  =================================
prompt  Parent/Child rowcounts for id = 1
prompt  =================================

select count(*) from t1 where id = 1;
select count(*) from t2 where id_par = 1;

column now new_value m_now
select to_char(sysdate,'dd-mon-yyyy hh24:mi:ss') now from dual;

audit delete on t2 by access; 
audit delete on t1 by access; 

prompt  =======================================================
prompt  If you allow the cascade (keep the t2 delete commented)
prompt  then the cascade deletion is not audited.
prompt  =======================================================

-- delete from t2 where id_par = 1;
delete from t1 where id = 1;

noaudit delete on t1; 
noaudit delete on t2; 

column obj_name format a32

select  action_name, obj_name 
from    user_audit_trail
where   timestamp >= to_date('&m_now','dd-mon-yyyy hh24:mi:ss')
;

The script has an optional delete from the child table (t2) just before the delete from the parent table (t1). When you run the script you should see that before the delete t1 reports one row while t2 reports 5 rows. After the delete(s) both tables will report zero rows.

If you leave the t2 delete commented out then the delete from t2 will have been the recursive delete due to the cascade and the number of rows returned from user_audit_trail will be one (the delete from t1). If you allow the explicit delete from t2 to take place then user_audit_trail will report two rows, one each for t1 and t2.

Sample output (with a little cosmetic editing) – when the delete from t2 is commented out:

=================================
Parent/Child rowcounts for id = 1
=================================

  COUNT(*)
----------
         1


  COUNT(*)
----------
         5

Audit succeeded.
Audit succeeded.

=======================================================
If you allow the cascade (keep the t2 delete commented)
then the cascade deletion is not audited.
=======================================================

1 row deleted.


Noaudit succeeded.
Noaudit succeeded.


  COUNT(*)
----------
         0


  COUNT(*)
----------
         0


ACTION_NAME                  OBJ_NAME
---------------------------- --------------------------------
DELETE                       T1

1 row selected.

As you can see, I’ve deleted just one row from one table (the t1 delete), but the final query against t2 shows that the child rows have also been deleted, but the only audit record reported is the one for the parent – despite the fact that if you enable sql_trace before the delete from t1 you will find the actual recursive statement ‘ delete from “TEST_USER”.”T2″ where “ID_PAR” = :1’ in the trace file.

The “recursive” nature of the delete in the trace file might be a bit of a clue – it is reported as being operated by SYS (lid = 0), not by the real end-user, and the parsing_user_id and parsing_schema_id in v$sql are both SYS (i.e. 0). Checking dba_audit_trail and the audit_file_dest for audited SYS operations, though, there was still no sign of any audit record for the child delete.

 

Bitmap Join Indexes

Fri, 2018-05-18 08:29

I’ve been prompted by a recent question on the ODC database forum to revisit a note I wrote nearly five years ago about bitmap join indexes and their failure to help with join cardinalities. At the time I made a couple of unsupported claims and suggestions without supplying any justification or proof. Today’s article finally fills that gap.

The problem is this – I have a column which exhibits an extreme skew in its data distribution, but it’s in a “fact” table where most columns are meaningless ids and I have to join to a dimension table on its primary key to translate an id into a name. While there is a histogram on the column in the fact table the information in the histogram ceases to help if I do the join to the dimension and query by name, and the presence of a bitmap join index doesn’t make any difference. Let’s see this in action – some of the code follows a different pattern and format from my usual style because I started by copying and editing the example supplied in the database forum:


rem
rem     Script:         bitmap_join_4.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem
rem     Last tested 
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem
rem     Notes:
rem     Bitmap join indexes generate virtual columns on the fact table
rem     but you can't get stats on those columns - which means if the
rem     data is skewed you can have a histogram on the raw column but
rem     you don't have a histogram on the bitmap virtual column.
rem

drop table t1;
drop table dim_table;

create table dim_table (type_code number, object_type varchar2(10));

insert into dim_table values (1,'TABLE');
insert into dim_table values (2,'INDEX');
insert into dim_table values (3,'VIEW');
insert into dim_table values (4,'SYNONYM');
insert into dim_table values (5,'OTHER');

alter table dim_table add constraint dim_table_pk primary key (type_code) using index;

exec dbms_stats.gather_table_stats(user,'dim_table',cascade=>true);

create table t1 
nologging
as 
select 
        object_id, object_name, 
        decode(object_type, 'TABLE',1,'INDEX',2,'VIEW',3,'SYNONYM',4,5) type_code 
from 
        all_objects
where
        rownum <= 50000 -- > comment to bypass wordpress format issue
;

insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;


create  bitmap index t1_b1 on t1(dt.object_type)
from    t1, dim_table dt
where   t1.type_code = dt.type_code
;

exec dbms_stats.gather_table_stats(null, 't1', cascade=>true, method_opt=>'for all columns size 254');


select
        dt.object_type, count(*)
from
        t1, dim_table  dt
where
        t1.type_code   = dt.type_code
group by
        dt.object_type
order by
        dt.object_type
;

I’ve started with a dimension table that lists 5 type codes and has a primary key on that type code; then I’ve used all_objects to generate a table of 400,000 rows using those type codes, and I’ve created a bitmap join index on the fact (t1) table based on the dimension (dim_table) table column. By choice the distribution of the five codes is massively skewed so after gathering stats (including histograms on all columns) for the table I’ve produced a simple aggregate report of the data showing how many rows there are of each type – by name. Here are the results – with the execution plan from 12.1.0.2 showing the benefit of the “group by placement” transformation:


OBJECT_TYP   COUNT(*)
---------- ----------
INDEX           12960
OTHER          150376
SYNONYM        177368
TABLE           12592
VIEW            46704

5 rows selected.

-------------------------------------------------------------------
| Id  | Operation             | Name      | Rows  | Bytes | Cost  |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT      |           |       |       |   735 |
|   1 |  SORT GROUP BY        |           |     5 |   125 |   735 |
|*  2 |   HASH JOIN           |           |     5 |   125 |   720 |
|   3 |    VIEW               | VW_GBF_7  |     5 |    80 |   717 |
|   4 |     HASH GROUP BY     |           |     5 |    15 |   717 |
|   5 |      TABLE ACCESS FULL| T1        |   400K|  1171K|   315 |
|   6 |    TABLE ACCESS FULL  | DIM_TABLE |     5 |    45 |     2 |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ITEM_1"="DT"."TYPE_CODE")

Having established the basic result we can now examine some execution plans to see how well the optimizer is estimating cardinality for queries relating to that skewed distribution. I’m going to generate the execution plans for a simple select of all the rows of type ‘TABLE’ – first by code, then by name, showing the execution plan of each query:


explain plan for
select  t1.object_id
from
        t1
where
        t1.type_code = 1
;

select * from table(dbms_xplan.display(null,null,'outline'));


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 12592 |    98K|   281   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   | 12592 |    98K|   281   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T1"."TYPE_CODE"=1)

Thanks to the histogram I generated on the type_code table the optimizer’s estimate of the number of rows is very accurate. So how well does the optimizer handle the join statistics:


prompt  =============
prompt  Unhinted join
prompt  =============

explain plan for
select  t1.object_id
from
        t1, dim_table  dt
where
        t1.type_code   = dt.type_code 
and     dt.object_type = 'TABLE'
;

select * from table(dbms_xplan.display(null,null,'outline'));

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           | 80000 |  1328K|   287  (10)| 00:00:01 |
|*  1 |  HASH JOIN         |           | 80000 |  1328K|   287  (10)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| DIM_TABLE |     1 |     9 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1        |   400K|  3125K|   277   (7)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."TYPE_CODE"="DT"."TYPE_CODE")
   2 - filter("DT"."OBJECT_TYPE"='TABLE')

Taking the default execution path the optimizer’s estimate of rows identified by type name is 80,000 – which is one fifth of the total number of rows. Oracle knows that the type_code is skewed in t1, but at compile time doesn’t have any idea which type_code corresponds to type ‘TABLE’, so it’s basically using the number of distinct values to dictate the estimate.

We could try hinting the query to make sure it uses the bitmap join index – just in case this somehow helps the optimizer (and we’ll see in a moment why we might have this hope, and why it is forlorn):


prompt  ===================
prompt  Hinted index access
prompt  ===================

explain plan for
select 
        /*+ index(t1 t1_b1) */
        t1.object_id
from
        t1, dim_table dt
where
        t1.type_code   = dt.type_code 
and     dt.object_type = 'TABLE'
;

select * from table(dbms_xplan.display(null,null,'outline'));

---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       | 80000 |   625K|   687   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    | 80000 |   625K|   687   (1)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS       |       |       |       |            |          |
|*  3 |    BITMAP INDEX SINGLE VALUE        | T1_B1 |       |       |            |          |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."SYS_NC00004$"='TABLE')

The plan tells us that the optimizer now realises that it doesn’t need to reference the dimension table at all – all the information it needs is in the t1 table and its bitmap join index – but it still comes up with an estimate of 80,000 for the number of rows. The predicate section tells us what to do next – it identifies a system-generated column, which is the virtual column underlying the bitmap join index: let’s see what the stats on that column look like:


select
        column_name, histogram, num_buckets, num_distinct, num_nulls, sample_size
from
        user_tab_cols
where
        table_name = 'T1'
order by
        column_id
;


COLUMN_NAME          HISTOGRAM       NUM_BUCKETS NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
-------------------- --------------- ----------- ------------ ---------- -----------
OBJECT_ID            HYBRID                  254        50388          0        5559
OBJECT_NAME          HYBRID                  254        29224          0        5560
TYPE_CODE            FREQUENCY                 5            5          0      400000
SYS_NC00004$         NONE

4 rows selected.

There are no stats on the virtual column – and Oracle won’t try to collect any, and even if you write some in (using dbms_stats.set_column_stats) it won’t use them for the query. The optimizer seems to be coded to use the number of distinct keys from the index in this case.

Workaround

It’s very disappointing that there seems to be no official way to work around this problem – but Oracle has their own (undocumented) solution to the problem that comes into play with OLAP – the hint /*+ precompute_subquery() */. It’s possible to tell the optimizer to execute certain types of subquery as the first stage of optimising a query, then changing the query to take advantage of the resulting data:


explain plan for
select
        /*+
                qb_name(main)
                precompute_subquery(@subq)
        */
        t1.object_id
from
        t1
where
        t1.type_code in (
                select
                        /*+
                                qb_name(subq)
                        */
                        dt.type_code
                from    dim_table dt
                where   dt.object_type = 'TABLE'
        )
;

select * from table(dbms_xplan.display(null,null,'outline'));

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 12592 |    98K|   281   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   | 12592 |    98K|   281   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T1"."TYPE_CODE"=1)

Oracle hasn’t optimized the query I wrote, instead it has executed the subquery, derived a (very short, in this case) list of values, then optimized and executed the query I first wrote using the constant(s) returned by the subquery. And you can’t see the original subquery in the execution plan. Of course, with the literal values in place, the cardinality estimate is now correct.

It’s such a pity that this hint is undocumented, and one that you shouldn’t use in production.

 

Skip Scan 3

Fri, 2018-05-11 08:26

If you’ve come across any references to the “index skip scan” operation for execution plans you’ve probably got some idea that this can appear when the number of distinct values for the first column (or columns – since you can skip multiple columns) is small. If so, what do you make of this demonstration:


rem
rem     Script:         skip_scan_cunning.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem

begin
        dbms_stats.set_system_stats('MBRC',16);
        dbms_stats.set_system_stats('MREADTIM',10);
        dbms_stats.set_system_stats('SREADTIM',5);
        dbms_stats.set_system_stats('CPUSPEED',1000);
end;
/

create table t1
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          id1,
        rownum                          id2,
        lpad(rownum,10,'0')             v1,
        lpad('x',150,'x')               padding
/*
        cast(rownum as number(8,0))                     id,
        cast(lpad(rownum,10,'0') as varchar2(10))       v1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
*/
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

create index t1_i1 on t1(id1, id2);

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

For repeatability I’ve set some system statistics, but if you’ve left the system stats to default you should see the same effect. All I’ve done is create a table and an index on that table. The way I’ve defined the id1 and id2 columns means they could individually support unique constraints and the index clearly has 1 million distinct values for id1 in the million index entries. So what execution plan do you think I’m likely to get from the following simple query:


set serveroutput off
alter session set statistics_level = all;

prompt  =======
prompt  Default
prompt  =======

select  id 
from    t1
where   id2 = 999
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

You’re probably not expecting an index skip scan to appear, but given the title of this posting you may have a suspicion that it will; so here’s the plan I got running this test on 12.2.0.1:


SQL_ID  8r5xghdx1m3hn, child number 0
-------------------------------------
select id from t1 where id2 = 999

Plan hash value: 400488565

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |  2929 (100)|      1 |00:00:00.17 |    2932 |      5 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |  2929   (1)|      1 |00:00:00.17 |    2932 |      5 |
|*  2 |   INDEX SKIP SCAN                   | T1_I1 |      1 |      1 |  2928   (1)|      1 |00:00:00.17 |    2931 |      4 |
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ID2"=999)
       filter("ID2"=999)


So, an index skip scan doesn’t require a small number of distinct values for the first column of the index (unless you’re running a version older than 11.2.0.2 where a code change appeared that could be disabled by setting fix_control 9195582 off).

When the optimizer doesn’t do what you expect it’s always worth hinting the code to follow the plan you were expecting – so here’s the effect of hinting a full tablescan (which happened to do direct path reads):

SQL_ID  bxqwhsjwqfm7q, child number 0
-------------------------------------
select  /*+ full(t1) */  id from t1 where id2 = 999

Plan hash value: 3617692013

----------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |  3317 (100)|      1 |00:00:00.12 |   25652 |  25635 |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |      1 |  3317   (3)|      1 |00:00:00.12 |   25652 |  25635 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ID2"=999)

Note that the cost is actually more expensive than the cost of the indexed access path.  For reference you need to know that the blocks statistic for the table was 25,842 while the number of index leaf blocks was 2,922. The latter figure (combined with a couple of other details regarding the clustering_factor and undeclared uniqueness of the index) explains why the cost of the skip scan was only 2,928: the change that appeared in 11.2.0.2 limited the I/O cost of an index skip scan to the total number of leaf blocks in the index.  The tablescan cost (with my system stats) was basically dividing my table block count by 16 (to get the number of multi-block reads) and then doubling (because the multiblock read time is twice the single block read time).

As a quick demo of how older versions of Oracle would behave after setting “_fix_control”=’9195582:OFF’:


SQL_ID	bn0p9072w9vfc, child number 1
-------------------------------------
select	/*+ index_ss(t1) */  id from t1 where id2 = 999

Plan hash value: 400488565

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation			    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |	A-Time	 | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		    |	    |	   1 |	      |  1001K(100)|	  1 |00:00:00.13 |    2932 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |	   1 |	    1 |  1001K	(1)|	  1 |00:00:00.13 |    2932 |
|*  2 |   INDEX SKIP SCAN		    | T1_I1 |	   1 |	    1 |  1001K	(1)|	  1 |00:00:00.13 |    2931 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID2"=999)
       filter("ID2"=999)

The cost of the skip scan is now a little over 1,000,000 – corresponding (approximately) to the 1 million index probes that will have to take place. You’ll notice that the number of buffer visits recorded is 2931 for the index operation, though: this is the result of the run-time optimisation that keeps buffers pinned very aggressively for skip scan – you might expect to see a huge number of visits recorded as “buffer is pinned count”, but for some reason that doesn’t happen. The cost is essentially Oracle calculating (with pinned root and branch) the cost of “id1 = {constant} and id2 = 999” and multiplying by ndv(id1).

Footnote:

Ideally, of course, the optimizer ought to work out that an index fast full scan followed by a table access ought to have a lower cost (using multi-block reads rather than walking the index in leaf block order one block at a time (which is what this particular skip scan will have to do) – but that’s not (yet) an acceptable execution plan though it does now appear a plan for deleting data.

tl;dr

If you have an index that is very much smaller than the table you may find examples where the optimizer does what appears to be an insanely stupid index skip scan when you were expecting a tablescan or, possibly, some other less efficient index to be used. There is a rationale for this, but such a plan may be much more CPU and read intensive than it really ought to be.

 

20 Indexes

Tue, 2018-05-08 06:53

If your system had to do a lot of distributed queries there’s a limit on indexes that might affect performance: when deriving an execution plan for a distributed query the optimizer will consider a maximum of twenty indexes on each remote table. if you have any tables with a ridiculous number of indexes (various 3rd party accounting and CRM systems spring to mind) and if you drop and recreate indexes on those tables in the wrong order then execution plans may change for the simple reason that the optimizer is considering a different subset of the available indexes.

Although the limit is stated in the manuals (a few lines into a section on managing statement transparency) there is no indication about which 20 indexes the optimizer is likely to choose – a couple of experiments, with tracing enabled and shared pool flushes, gives a fairly strong indication that it’s the last 20 indexes created (or, to be more explicit, the ones with the 20 highest object_id values).

Here’s a little code to help demonstrate the point – first just the table and index creation


rem
rem	Script:		indexes_20.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Apr 2008
rem
rem	Last tested 
rem		12.2.0.1
rem

create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects 
	where	rownum <= 3000 -- > comment to avoid WordPress format issue
)
select
	mod(rownum,trunc(5000/1))	n01,
	mod(rownum,trunc(5000/2))	n02,
	mod(rownum,trunc(5000/3))	n03,
	mod(rownum,trunc(5000/4))	n04,
	mod(rownum,trunc(5000/5))	n05,
	mod(rownum,trunc(5000/6))	n06,
	mod(rownum,trunc(5000/7))	n07,
	mod(rownum,trunc(5000/8))	n08,
	mod(rownum,trunc(5000/9))	n09,
	mod(rownum,trunc(5000/10))	n10,
	mod(rownum,trunc(5000/11))	n11,
	mod(rownum,trunc(5000/12))	n12,
	mod(rownum,trunc(5000/13))	n13,
	mod(rownum,trunc(5000/14))	n14,
	mod(rownum,trunc(5000/15))	n15,
	mod(rownum,trunc(5000/16))	n16,
	mod(rownum,trunc(5000/17))	n17,
	mod(rownum,trunc(5000/18))	n18,
	mod(rownum,trunc(5000/19))	n19,
	mod(rownum,trunc(5000/20))	n20,
	mod(rownum,trunc(5000/21))	n21,
	mod(rownum,trunc(5000/22))	n22,
	mod(rownum,trunc(5000/23))	n23,
	mod(rownum,trunc(5000/24))	n24,
	rownum				id,
	rpad('x',40)			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5 -- > comment to avoid WordPress format issue
;


alter table t1 add constraint t1_pk primary key(id)

create table t2
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects 
	where	rownum <= 3000 -- > comment to avoid WordPress format issue
)
select
	mod(rownum,trunc(5000/1))	n01,
	mod(rownum,trunc(5000/2))	n02,
	mod(rownum,trunc(5000/3))	n03,
	mod(rownum,trunc(5000/4))	n04,
	mod(rownum,trunc(5000/5))	n05,
	mod(rownum,trunc(5000/6))	n06,
	mod(rownum,trunc(5000/7))	n07,
	mod(rownum,trunc(5000/8))	n08,
	mod(rownum,trunc(5000/9))	n09,
	mod(rownum,trunc(5000/10))	n10,
	mod(rownum,trunc(5000/11))	n11,
	mod(rownum,trunc(5000/12))	n12,
	mod(rownum,trunc(5000/13))	n13,
	mod(rownum,trunc(5000/14))	n14,
	mod(rownum,trunc(5000/15))	n15,
	mod(rownum,trunc(5000/16))	n16,
	mod(rownum,trunc(5000/17))	n17,
	mod(rownum,trunc(5000/18))	n18,
	mod(rownum,trunc(5000/19))	n19,
	mod(rownum,trunc(5000/20))	n20,
	mod(rownum,trunc(5000/21))	n21,
	mod(rownum,trunc(5000/22))	n22,
	mod(rownum,trunc(5000/23))	n23,
	mod(rownum,trunc(5000/24))	n24,
	rownum				id,
	rpad('x',40)			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5 -- > comment to avoid WordPress format issue
;

create index t2_a21 on t2(n21);
create index t2_a22 on t2(n22);
create index t2_a23 on t2(n23);
create index t2_a24 on t2(n24);

create index t2_z01 on t2(n01);
create index t2_z02 on t2(n02);
create index t2_z03 on t2(n03);
create index t2_z04 on t2(n04);
create index t2_z05 on t2(n05);
create index t2_z06 on t2(n06);
create index t2_z07 on t2(n07);
create index t2_z08 on t2(n08);
create index t2_z09 on t2(n09);
create index t2_z10 on t2(n10);

create index t2_i11 on t2(n11);
create index t2_i12 on t2(n12);
create index t2_i13 on t2(n13);
create index t2_i14 on t2(n14);
create index t2_i15 on t2(n15);
create index t2_i16 on t2(n16);
create index t2_i17 on t2(n17);
create index t2_i18 on t2(n18);
create index t2_i19 on t2(n19);
create index t2_i20 on t2(n20);

alter index t2_a21 rebuild;
alter index t2_a22 rebuild;
alter index t2_a23 rebuild;
alter index t2_a24 rebuild;
 

begin
        dbms_stats.gather_table_stats(
                ownname 	 => user,
		tabname		 =>'t1',
		method_opt 	 => 'for all columns size 1',
		cascade		 => true
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'t2',
		method_opt 	 => 'for all columns size 1',
		cascade		 => true
	);

end;
/

I’m going to use a loopback database link to join “local” table t1 to “remote” table t2 on all 24 of the nXX columns. I’ve created indexes on all the columns, messing around with index names, order of creation, and rebuilding, to cover possible selection criteria such as alphabetical order, ordering by data_object_id (rather than object_id), even ordering by name of indexed columns(!).

Now the code to run a test:


define m_target=orcl@loopback

alter session set events '10053 trace name context forever';
set serveroutput off

select
	t1.id,
	t2.id,
	t2.padding
from
	t1			t1,
	t2@&m_target		t2
where
	t1.id = 99
and	t2.n01 = t1.n01
and	t2.n02 = t1.n02
and	t2.n03 = t1.n03
and	t2.n04 = t1.n04
and	t2.n05 = t1.n05
and	t2.n06 = t1.n06
and	t2.n07 = t1.n07
and	t2.n08 = t1.n08
and	t2.n09 = t1.n09
and	t2.n10 = t1.n10
/*			*/
and	t2.n11 = t1.n11
and	t2.n12 = t1.n12
and	t2.n13 = t1.n13
and	t2.n14 = t1.n14
and	t2.n15 = t1.n15
and	t2.n16 = t1.n16
and	t2.n17 = t1.n17
and	t2.n18 = t1.n18
and	t2.n19 = t1.n19
and	t2.n20 = t1.n20
/*			*/
and	t2.n21 = t1.n21
and	t2.n22 = t1.n22
and	t2.n23 = t1.n23
and	t2.n24 = t1.n24
;

select * from table(dbms_xplan.display_cursor(null,null,'outline'));

alter session set events '10053 trace name context off';

I’ve used a substitution variable for the name of the database link – it’s a convenience I have with all my distributed tests, a list of possible defines at the top of the script depending on which database I happen to be using at the time – then enabled the optimizer (10053) trace, set serveroutput off so that I can pull the execution plan from memory most easily, then executed the query.

Here’s the execution plan – including the Remote section and Outline.


-------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |   270 (100)|          |        |      |
|   1 |  NESTED LOOPS      |      |     1 |   243 |   270   (6)| 00:00:01 |        |      |
|*  2 |   TABLE ACCESS FULL| T1   |     1 |   101 |   268   (6)| 00:00:01 |        |      |
|   3 |   REMOTE           | T2   |     1 |   142 |     2   (0)| 00:00:01 | ORCL@~ | R->S |
-------------------------------------------------------------------------------------------


Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      USE_NL(@"SEL$1" "T2"@"SEL$1")
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."ID"=99)

Remote SQL Information (identified by operation id):
----------------------------------------------------
   3 - SELECT "N01","N02","N03","N04","N05","N06","N07","N08","N09","N10","N11","N1
       2","N13","N14","N15","N16","N17","N18","N19","N20","N21","N22","N23","N24","ID","PA
       DDING" FROM "T2" "T2" WHERE "N01"=:1 AND "N02"=:2 AND "N03"=:3 AND "N04"=:4 AND
       "N05"=:5 AND "N06"=:6 AND "N07"=:7 AND "N08"=:8 AND "N09"=:9 AND "N10"=:10 AND
       "N11"=:11 AND "N12"=:12 AND "N13"=:13 AND "N14"=:14 AND "N15"=:15 AND "N16"=:16
       AND "N17"=:17 AND "N18"=:18 AND "N19"=:19 AND "N20"=:20 AND "N21"=:21 AND
       "N22"=:22 AND "N23"=:23 AND "N24"=:24 (accessing 'ORCL@LOOPBACK' )

There’s a little oddity with the plan – specifically in the Outline: there’s a “full(t2)” hint which is clearly inappropriate and isn’t consistent with the cost of 2 for the REMOTE operation reported in the body of the plan. Fortunately the SQL forwarded to the “remote” database doesn’t include this hint and (you’ll have to take my word for it) used an indexed access path into the table.

Where, though, is the indication that Oracle considered only 20 indexes? It’s in the 10053 trace file under the “Base Statistical Information” section in the subsection headed “Index Stats”:


Index Stats::
  Index: 0  Col#: 20    (NOT ANALYZED)
  LVLS: 1  #LB: 204  #DK: 250  LB/K: 1.00  DB/K: 400.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 19    (NOT ANALYZED)
  LVLS: 1  #LB: 204  #DK: 263  LB/K: 1.00  DB/K: 380.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 18    (NOT ANALYZED)
  LVLS: 1  #LB: 205  #DK: 277  LB/K: 1.00  DB/K: 361.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 17    (NOT ANALYZED)
  LVLS: 1  #LB: 205  #DK: 294  LB/K: 1.00  DB/K: 340.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 16    (NOT ANALYZED)
  LVLS: 1  #LB: 205  #DK: 312  LB/K: 1.00  DB/K: 320.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 15    (NOT ANALYZED)
  LVLS: 1  #LB: 205  #DK: 333  LB/K: 1.00  DB/K: 300.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 14    (NOT ANALYZED)
  LVLS: 1  #LB: 206  #DK: 357  LB/K: 1.00  DB/K: 280.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 13    (NOT ANALYZED)
  LVLS: 1  #LB: 206  #DK: 384  LB/K: 1.00  DB/K: 260.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 12    (NOT ANALYZED)
  LVLS: 1  #LB: 206  #DK: 416  LB/K: 1.00  DB/K: 240.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 11    (NOT ANALYZED)
  LVLS: 1  #LB: 206  #DK: 454  LB/K: 1.00  DB/K: 220.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 10    (NOT ANALYZED)
  LVLS: 1  #LB: 207  #DK: 500  LB/K: 1.00  DB/K: 200.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 9    (NOT ANALYZED)
  LVLS: 1  #LB: 207  #DK: 555  LB/K: 1.00  DB/K: 180.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 8    (NOT ANALYZED)
  LVLS: 1  #LB: 207  #DK: 625  LB/K: 1.00  DB/K: 160.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 7    (NOT ANALYZED)
  LVLS: 1  #LB: 208  #DK: 714  LB/K: 1.00  DB/K: 140.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 6    (NOT ANALYZED)
  LVLS: 1  #LB: 208  #DK: 833  LB/K: 1.00  DB/K: 120.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 5    (NOT ANALYZED)
  LVLS: 1  #LB: 208  #DK: 1000  LB/K: 1.00  DB/K: 100.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 4    (NOT ANALYZED)
  LVLS: 1  #LB: 208  #DK: 1250  LB/K: 1.00  DB/K: 80.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 3    (NOT ANALYZED)
  LVLS: 1  #LB: 209  #DK: 1666  LB/K: 1.00  DB/K: 60.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 2    (NOT ANALYZED)
  LVLS: 1  #LB: 209  #DK: 2500  LB/K: 1.00  DB/K: 40.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 
  Index: 0  Col#: 1    (NOT ANALYZED)
  LVLS: 1  #LB: 209  #DK: 5000  LB/K: 1.00  DB/K: 20.00  CLUF: 2002.00  NRW: 0.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 0 BSZ: 0
  KKEISFLG: 0 

We have 20 indexes listed, and while they’re all called “Index 0” (and reported as “Not Analyzed”) we can see from their column definitions that they are (in reverse order) the indexes on columns n01 through to n20 – i.e. the last 20 indexes created. The optimizer has created its plan based only on its knowledge of these indexes.

We might ask whether this matters or not – after all when the remote SQL gets to the remote database the remote optimizer is going to (re-)optimize it anyway and do the best it can with it, so at run-time Oracle could still end up using remote indexes that the local optimizer didn’t know about. So let’s get nasty and give the local optimizer a problem:


create index t2_id on t2(id);

select
        t1.id,
        t2.id,
        t2.padding
from
        t1                      t1,
        t2@&m_target            t2
where
        t1.id = 99
and     t2.n01 = t1.n01
;

I’ve created one more index on t2, which means the local optimizer is going to “forget” about the index that was the previous 20th index on the most recently created list for t2. That’s the index on (n01), which would have been a very good index for this query. If this query were to run locally the optimizer would do a nested loop from t1 to t2 using the index on (n01) – but the optimizer no longer knows about that index, so we get the following plan:


-------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |   538 (100)|          |        |      |
|*  1 |  HASH JOIN         |      |    20 |  1140 |   538   (7)| 00:00:01 |        |      |
|*  2 |   TABLE ACCESS FULL| T1   |     1 |     9 |   268   (6)| 00:00:01 |        |      |
|   3 |   REMOTE           | T2   |   100K|  4687K|   268   (6)| 00:00:01 | ORCL@~ | R->S |
-------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."N01"="T1"."N01")
   2 - filter("T1"."ID"=99)

Remote SQL Information (identified by operation id):
----------------------------------------------------
   3 - SELECT "N01","ID","PADDING" FROM "T2" "T2" (accessing 'ORCL@LOOPBACK' )

Oracle is going to do a hash join and apply the join predicate late. Although the remote optimizer can sometimes rescue us from a mistake made by the local optimizer and use indexes that the local optimizer doesn’t know about, there are times when the remote SQL generated by the local optimizer is so rigidly associated with the expected plan that there’s no way the remote optimizer can workaround the assumptions made by the local optimizer.

So when you create (or drop and recreate) an index, it’s just possible that a distributed plan will have to change because the local optimizer is no longer aware of an index that exists at the remote site.

tl;dr

Be very cautious about dropping and recreating indexes if the table in question

  1. has more than 20 indexes
  2. and is used at the remote end of a distributed execution plan

The optimizer will consider only 20 of the indexes on the table, choosing the ones with the highest object_ids. If you drop and recreate an index then it gets a new (highest) object_id and a plan may change because the index that Oracle was previously using is no longer in the top 20.

FBIs don’t exist

Mon, 2018-05-07 03:24

This is a reprint (of a reprint) of a note I wrote more than 11 years ago on my old website. I’ve decided to republish it on the blog simply because one day I’ll probably decide to stop paying for the website given how old all the material is and this article makes an important point about the need (at least some of the time) for accuracy in the words you use to describe things.

—————————————————————————-

There’s no such thing as a function-based index.

Well, okay, that’s what the manuals call them but it would be so much better if they were called “indexes with virtual columns” – because that’s what they are and that’s a name that would eliminate confusion.

To demonstrate what I mean, ask yourself this question: “Can the rule based optimizer use a function-based index ?”. The answer is ‘Yes’, as the following code fragment demonstrates:


rem
rem     Script:         fbi_rule.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2005
rem

create table t1 as
select
         rownum                         id,
         dbms_random.value(0,500)       n1,
         rpad('x',10)                   small_vc,
         rpad('x',100)                  padding
from
         all_objects
where
         rownum <= 3000
;
 
create index t1_i1 on t1(id, trunc(n1));
 
set autotrace traceonly explain
 
select
         /*+ rule */
         small_vc
from
         t1
where    id = 55
and      trunc(n1) between 1 and 10
;


set autotrace off
 
Execution Plan
----------------------------------------------------------
0      SELECT STATEMENT Optimizer=HINT: RULE
1   0    TABLE ACCESS (BY INDEX ROWID) OF 'T1'
2   1      INDEX (RANGE SCAN) OF 'T1_I1' (NON-UNIQUE)

Last time I asked an audience if the rule-based optimizer (RBO) could use a function-based index, most of them thought the answer was ‘No’. Even the Oracle manuals make the same mistake – for example in the 10g Release 2 Application Developers Guide p5-8, one of the restrictions on function-based indexes is “Only cost based optimization can use function-based indexes”.

If I had asked the audience “Can the rule-based optimizer use an index which includes a virtual column ?” I wonder how many of them would have paused for thought, then asked themselves what would happen if the index started with “ordinary” columns and the “function-based” bit was later on in the index.

The manuals should, of course, state: “The rule-based optimizer cannot take advantage of any virtual columns in an index, or of any columns that follow the first virtual column”. Given a correct name and a correct description of functionality you can then conclude that if the first column is a virtual column the rule-based optimizer won’t use the index.

I’m not suggesting, by the way, that you should be using the rule-based optimizer, or even that this specific example of functionality is going to be particularly beneficial to many people (RBO still uses the “trunc(n1)” as a filter predicate after reaching the table rather than as an access predicate – or even filter predicate – on the index); but it does demonstrate how easy it is for the wrong name, or terminology, to distract people from the truth.

And here’s another thought for Oracle Corporation. Since it seems to be easy to implement virtual columns (there is a hidden entry for each such column in the data dictionary, and the text of the function defining the column appears as the default value), why should they exist only in indexes? Why can’t we have virtual columns which aren’t indexed, so that we can collect statistics on a virtual column and give the optimizer some information about the data distribution of some commonly used expression that we don’t actually want to build an index on.

(Update Jan 2007 – this is likely to happen in 11g according to ‘sneak preview’ presentations made by Oracle at OW2006.

P.S. There really are function-based indexes in Oracle. But Oracle Corp. calls them domain indexes (or co-operative indexes) and tells you that the things you build them with are operators, not functions … which actually makes them operator-based indexes!

—————————————————————————-

 Footnote (May 2018)

I’ve updated the reference to the 10g manuals (chapter 5 page 8) to include a URL, but the URL is for 11gR2 since the only 10g manual I could find online was the full pdf download.  It’s  interesting to note what restrictions on the use of “function-based” indexes are reported in this manual, and I’m not sure that all of them were true at the time, and I’m fairly sure that some of them must be false by now, which is why it’s always good to have test scripts that you can run as you upgrade.

There is an interesting variation over time for this example:

  • In 9.2.0.8 and 10.2.0.5 the predicate on trunc(n1) is a filter predicate on the table
  • In 11.1.0.7 the predicate trunc(n1) became an access predicate in the index
  • In 11.2.0.4 the optimizer (finally) declined to use the index under the rule hint (but introduced a strange side effect … more about that later)

Execution plan from 11.1.0.7


Execution Plan
----------------------------------------------------------
Plan hash value: 1429545322

---------------------------------------------
| Id  | Operation                   | Name  |
---------------------------------------------
|   0 | SELECT STATEMENT            |       |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |
|*  2 |   INDEX RANGE SCAN          | T1_I1 |
---------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID"=55 AND TRUNC("N1")>=1 AND TRUNC("N1")<=10)

Note
-----
   - rule based optimizer used (consider using cbo)

In passing – the change in the execution plan from 10g to 11.1 to 11.2 does mean that anyone still using the rule-based optimizer could find that an upgrade makes a difference to rule-based execution plans.

As well as ignoring the index, 11.2.0.4 did something else that was new. I happened to have a second index on the table defined as (n1, trunc(id)); this had no impact on the execution plan for all the previous versions of Oracle, apart from switching to a full tablescan 11.2.0.4 also introduced an extra predicate:


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

----------------------------------
| Id  | Operation         | Name |
----------------------------------
|   0 | SELECT STATEMENT  |      |
|*  1 |  TABLE ACCESS FULL| T1   |
----------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(TRUNC("N1")<=10 AND TRUNC("N1")>=1 AND
              TRUNC("ID")=TRUNC(55) AND "ID"=55)

Note
-----
   - rule based optimizer used (consider using cbo)

Some piece of code somewhere must have been looking at the second “function-based index” – or, at least, it’s virtual column definition – to be able to generate that trunc(id) = trunc(55) predicate. This was a detail introduced in 11.2.0.2, affected by fix control 9263333: “generate transitive predicates for virtual column expressions”. It’s possible that a change like this could result in changes in execution plan due to the extra predicates – even under rule-based optimisation.

FBI Limitation

Fri, 2018-05-04 03:19

A recent question on the ODC (OTN) database forum prompted me to point out that the optimizer doesn’t consider function-based indexes on remote tables in distributed joins. I then spent 20 minutes trying to find the blog note where I had demonstrated this effect, or an entry in the manuals reporting the limitation – but I couldn’t find anything, so I’ve written a quick demo which I’ve run on 12.2.0.1 to show the effect. First, the SQL to create a couple of tables and a couple of indexes:


rem
rem     Script:         fbi_limitation.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem

-- create public database link orcl@loopback using 'orcl'; 
define m_target = orcl@loopback

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n1,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

create table t2
nologging
as
select * from t1
;

alter table t1 add constraint t1_pk primary key(id);
alter table t2 add constraint t2_pk primary key(id);
create unique index t2_f1 on t2(id+1);

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T2',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/


The code is very simple, it creates a couple of identical tables with an id column that will produce an index with a very good clustering_factor. You’ll notice that I’ve (previously) created a public database link that is (in my case) a loopback to the current database and the code defines a variable that I can use as a substitution variable later on. If you want to do further tests with this model you’ll need to make some changes in these two lines.

So now I’m going to execute a query that should result in the optimizer choosing a nested loop between the tables – but I have two versions of the query, one which treats t2 as the local table it really is, and one that pretends (through the loopback) that t2 is remote.


set serveroutput off

select
        t1.v1, t2.v1
from
        t1,
        t2
--      t2@orcl@loopback
where
        t2.id+1 = t1.id
and     t1.n1 between 101 and 110
;


select * from table(dbms_xplan.display_cursor);

select
        t1.v1, t2.v1
from
        t1,
--      t2
        t2@orcl@loopback
where
        t2.id+1 = t1.id
and     t1.n1 between 101 and 110
;

select * from table(dbms_xplan.display_cursor);

Here are the two execution plans, pulled from memory – including the “remote” section in the distributed case:


SQL_ID  fthq1tqthq8js, child number 0
-------------------------------------
select  t1.v1, t2.v1 from  t1,  t2 -- t2@orcl@loopback where  t2.id+1 =
t1.id and t1.n1 between 101 and 110

Plan hash value: 1798294492

--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |       |       |  2347 (100)|          |
|   1 |  NESTED LOOPS                |       |    11 |   407 |  2347   (3)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL          | T1    |    11 |   231 |  2325   (4)| 00:00:01 |
|   3 |   TABLE ACCESS BY INDEX ROWID| T2    |     1 |    16 |     2   (0)| 00:00:01 |
|*  4 |    INDEX UNIQUE SCAN         | T2_F1 |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("T1"."N1"<=110 AND "T1"."N1">=101))
   4 - access("T2"."SYS_NC00005$"="T1"."ID")

Note
-----
   - this is an adaptive plan




SQL_ID  ftnmywddff1bb, child number 0
-------------------------------------
select  t1.v1, t2.v1 from  t1, -- t2  t2@orcl@loopback where  t2.id+1 =
t1.id and t1.n1 between 101 and 110

Plan hash value: 1770389500

-------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |  4663 (100)|          |        |      |
|*  1 |  HASH JOIN         |      |    11 |   616 |  4663   (4)| 00:00:01 |        |      |
|*  2 |   TABLE ACCESS FULL| T1   |    11 |   231 |  2325   (4)| 00:00:01 |        |      |
|   3 |   REMOTE           | T2   |  1000K|    33M|  2319   (3)| 00:00:01 | ORCL@~ | R->S |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."ID"="T2"."ID"+1)
   2 - filter(("T1"."N1"<=110 AND "T1"."N1">=101))

Remote SQL Information (identified by operation id):
----------------------------------------------------
   3 - SELECT "ID","V1" FROM "T2" "T2" (accessing 'ORCL@LOOPBACK' )

Both plans show that the optimizer has estimated the number of rows that would be retrieved from t1 correctly (very nearly); but while the fully local query does a nested loop join using the high-precision, very efficient function-based index (reporting the internal supporting column referenced in the predicate section) the distributed query seems to have no idea about the remote function-based index and select all the required rows from the remote table and does a hash join.

Addendum

After finishing this note, I discovered that I had written a similar note about reverse key indexes nearly five years ago. Arguably a reverse key is just a special case of a function-based index – except it’s not labelled as such in user_tab_cols, and doesn’t depend on a system-generated hidden column.

 

exp catch

Tue, 2018-04-10 11:52

No-one should be using exp/imp to export and import data any more, they should be using the datapump equivalents expdp/impdp – but if you’re on an older (pre-12c) version of Oracle and still using exp/imp to do things like moving tables with their production statistics over to test systems then be careful that you don’t fall into an obsolescence trap when you finally upgrade to 12c (or Oracle 18).

exp/imp will mess up some of your histograms if you’re still using them to move tables/statistics in 12c.

Remember that 12c can create “Top-N” and “hybrid” histograms – and exp/imp were written long before these new histogram types came into existence. The code has not been updated to allow for the new histogram types so if you happen to generate any histograms of these type in a 12c system and then use exp/imp to move some table stats (and it’s particularly an issue relating to stats) from one system to another – the stats that arrive at the destination system won’t match the stats that left the source system.

Here’s a little sample code to build a model that I can use to demonstrate the problem. It creates a table with three columns that will make it easy for me to create one frequency histogram, one Top-N histogram and one hybrid histogram. I’ve included a couple of substitution variables in the code so that you can specify an Oracle instance to connect to and a directory for the export file that expdp is going to produce. Don’t forget to check that the directory I create in this script doesn’t overwrite a directory that already exists for other reasons on your test system.


rem
rem     Script:         12c_histograms.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2015
rem
rem     Define m_service to be the service name you connect to
rem     Define m_directory to be the O/S directory you to use for
rem     the export/import/log files
rem
rem     Make sure this code is not over-writing an existing 
rem     definition for a directory called DMPDIR before you 
rem     start


define m_service = 'orcl'
define m_service = 'or32'

define m_directory = '/mnt/working'

host rm &m_directory/expdat.dmp
host rm &m_directory/expdp.dmp

create or replace directory dmpdir as '&m_directory';

drop table t1 purge;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        trunc(sqrt(rownum + 0)) frequency,
        trunc(sqrt(rownum + 0)) top_n,
        trunc(sqrt(rownum + 0)) hybrid
from
        generator
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for columns frequency size 254 for columns top_n size 95 for columns hybrid size 50'
        );
end;
/

select
        column_name,
        num_distinct,
        histogram,
        num_buckets
from
        user_tab_cols
where
        table_name = 'T1'
;

column endpoint_actual_value format a22
break on column_name skip 1

select 
        column_name, 
        endpoint_number, endpoint_value, endpoint_actual_value, 
        endpoint_repeat_count 
from 
        user_tab_histograms 
where 
        table_name = 'T1'
order by 
        column_name, endpoint_number
;

Here’s an extract, from a 12.1.0.2 instance, of the results of the two queries with a large number of the rows from the histogram data deleted:


COLUMN_NAME              Distinct HISTOGRAM          Buckets
-------------------- ------------ --------------- ----------
FREQUENCY                     100 FREQUENCY              100
TOP_N                         100 TOP-FREQUENCY           95
HYBRID                        100 HYBRID                  50

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE  ENDPOINT_REPEAT_COUNT
-------------------- --------------- -------------- ---------------------- ---------------------
FREQUENCY                          3              1 1                                          0
                                   8              2 2                                          0
                                  15              3 3                                          0
...
                                9800             98 98                                         0
                                9999             99 99                                         0
                               10000            100 100                                        0


HYBRID                             3              1 1                                          3
                                 224             14 14                                        29
                                 440             20 20                                        41
...
                                9603             97 97                                       195
                                9800             98 98                                       197
                               10000            100 100                                        1

TOP_N                              1              1 1                                          0
                                  16              7 7                                          0
                                  33              8 8                                          0
...
                                9753             98 98                                         0
                                9952             99 99                                         0
                                9953            100 100                                        0

The most important detail is the endpoint_repeat_count column of the hybrid histogram, although you should note that the endpoint_actual_value columns is populated with a copy of the endpoint_value for all three histograms.

Now I’m going to use exp / drop table / imp (or the datapump equivalents) to export, drop, and re-import the table with (one hopes) the exact same statistics. To do this I’ll be using the imp command with the option “statistics=always” with the intention of copying the stats from the export file into the destination database (you’ll have to substitute your own userid/password, of course):


host exp   userid=test_user/test@&m_service file=expdat.dmp tables='(t1)'
-- host expdp userid=test_user/test@&m_service DIRECTORY=dmpdir DUMPFILE=expdp.dmp TABLES='(t1)'

drop table t1 purge;

host imp   userid=test_user/test@&m_service file=expdat.dmp tables='(t1)' statistics=always
-- host impdp userid=test_user/test@&m_service DIRECTORY=dmpdir DUMPFILE=expdp.dmp TABLES='(t1)'

So what do we see now when we re-run the two queries to report the histogram information:


COLUMN_NAME              Distinct HISTOGRAM          Buckets
-------------------- ------------ --------------- ----------
FREQUENCY                     100 FREQUENCY              100
TOP_N                         100 FREQUENCY               95
HYBRID                        100 FREQUENCY               50

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE  ENDPOINT_REPEAT_COUNT
-------------------- --------------- -------------- ---------------------- ---------------------
FREQUENCY                          3              1                                            0
                                   8              2                                            0
                                  15              3                                            0
...
                                9800             98                                            0
                                9999             99                                            0
                               10000            100                                            0

HYBRID                             3              1                                            0
                                 224             14                                            0
                                 440             20                                            0
...
                                9603             97                                            0
                                9800             98                                            0
                               10000            100                                            0

TOP_N                              1              1                                            0
                                  16              7                                            0
                                  33              8                                            0
...
                                9753             98                                            0
                                9952             99                                            0
                                9953            100                                            0

The histograms on all three columns are now labelled as FREQUENCY.
The endpoint_actual_value is null for all three – but that may be a purely cosmetic detail with no side effects.
The “hybrid” column really has become a frequency histogram – and that’s the critical one – the endpoint_repeat_count columns are all zero.

tl;dr

If you’re still using exp/imp instead of expdp/impdp to move tables (and, more importantly, their statistics) from one database to another then the upgrade to 12c may mean you end up with hybrid histograms on the source system that are “downgraded” to frequency histograms on the destination system, with the effect that execution plans vary between the two systems.

Data Hashing

Mon, 2018-04-09 12:10

Here’s a little-known feature that has been around since at least Oracle 10, though I don’t think I had ever seen it in the wild until today when someone reported on the ODC (OTN) database forum that they had a problem getting repeatable results.  It’s always possible, of course, that failure to get repeatable results is the natural consequence of running queries against a multi-user system, but if we assume that this was not the cause in this case we have to ask why a special hashing function that Oracle supplies to allow you to check that a set of data hasn’t changed gives you different results when “the data hasn’t changed”.

I’m talking about the function dbms_sqlhash.gethash() – a packaged function that exists in the SYS schema and isn’t usually exposed to other users. The function takes as its inputs the text of query, a selected hashing function, and a “chunk” size. It will run the query, and use the hashing function to return a single, 16 – 64 byte, hash value representing the entire result set. Here’s an example of usage:


begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

6496D2438FECA960B1E916BF8C4BADCA

I haven’t specified a chunk size – the default is 128MB – and Oracle will hash this much of the result set in a single pass. If the result set is larger than this Oracle will hash each chunk in turn then generate a hash of the hash values. (This means, by the way, that changing the chunk size can change the hash value for large data sets).

There are 6 possible digest types in 12.1.0.2 (listed in the $ORACLE_HOME/rdbms/admin/dbmsobtk.sql script that creates the dbms_crypto package – so you will need the execute privilege on both dbms_sqlhash and dbms_crypto to use the function if you want to code with symbolic constants):

rem         HASH_MD4           CONSTANT PLS_INTEGER            :=     1;
rem         HASH_MD5           CONSTANT PLS_INTEGER            :=     2;
rem         HASH_SH1           CONSTANT PLS_INTEGER            :=     3;
rem         HASH_SH256         CONSTANT PLS_INTEGER            :=     4;
rem         HASH_SH384         CONSTANT PLS_INTEGER            :=     5;
rem         HASH_SH512         CONSTANT PLS_INTEGER            :=     6;

Let’s put the whole thing into a demonstration that will allow us to see an important point – you have to be careful with your query:


rem
rem     Script:         gethash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2016
rem

execute dbms_random.seed(0)

create table t1
nologging
as
select
        1e4 - rownum                    id,
        trunc(dbms_random.value(0,100)) n1,
        trunc(sysdate)                  d1,
        lpad('x',100,'x')               padding
from
        dual
connect by
        level <= 1e4 -- > comment to avoid WordPress format issue
;

alter table t1 add constraint t1_pk primary key (id);

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

set feedback off

alter system flush shared_pool;
alter session set optimizer_mode = first_rows_1;

begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

alter system flush shared_pool;
alter session set optimizer_mode = all_rows;

begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

alter system flush shared_pool;
alter session set nls_date_format='dd-mon-yyyy hh24:mi:ss';

begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

alter session set nls_date_format='DD-MON-RR';

I’ve created a data set, added a primary key, and gathered stats, then I’ve called the same hashing function on the same sql statement three times in a row. However, I’ve changed the session environment for each call – in the first case I’ve set the optimizer to “first rows(1)” optimization, then I’ve set the optimizer back to all_rows, then I’ve changed the nls_date_format from its default of “DD-MON-RR” (and that’s significant because I’ve got a date column in my query). Here are the output from running the script:


Table created.


Table altered.


PL/SQL procedure successfully completed.

6496D2438FECA960B1E916BF8C4BADCA
D41D4A2945D0B89A6C5DEB5060189A54
ECC3D2B66CB61821397CD9BD983FD5F4

The query has to return the same data content in all three cases – but the hash value is different in the three cases. The change in the optimizer mode has affected the order in which the data was returned (with first_rows(1) Oracle did a full scan of the primary key index, with all_rows it did a tablescan and sort); the change in the nls_XXX parameter meant the internal representation of the data changed.

You have to be very careful with dbms_sqlhash every time you use it if you want the same data set to produce the same result. First, to be safe, you need to ensure that you always use the same NLS parameters when using the function; then you need to have an “order by” clause in the query, and the columns used in the order by clause need to a possible candidate key (i.e. unique, not null) for the data otherwise a change in the optimizer parameters, or the object stats, could result in a change in execution plan with an ensuing change in the actual order of the data and a different hash value.

 

SQL Monitor

Fri, 2018-04-06 01:50

I’ve mentioned the SQL Monitor report from time to time as a very useful way of reviewing execution plans – the feature is automatically enabled by parallel execution and by queries that are expected to take more than a few seconds to complete, and the inherent overheads of monitoring are less than the impact of enabling the rowsource execution statistics that allow you to use the ‘allstats’ format of dbms_xplan.display_cursor() to get detailed execution information for a query. The drawback to the SQL Monitor feature is that it doesn’t report predicate information. It’s also important to note that it falls under the performance and diagnostic licences: some of the available performance informtion comes from v$active_session_history, and the report is generated by a call to the dbms_sqltune package.

There are two basic calls – report_sql_monitor_list(), which appeared in 11.2, produces a summary of the statements and their individual executions (from the information that is still in memory, of course) and report_sql_monitor() shows detailed execution plans. Here’s a simple bit of SQL*Plus code showing basic use – it lists a summary of all the statements monitored in the last half hour, then (as it stands at present) the full monitoring details of the most recently completed monitored statement:


set long 250000
set longchunksize 65536

set linesize 254
set pagesize 100
set trimspool on

set heading off

column text_line format a254

spool report_sql_monitor

select 
        dbms_sqltune.report_sql_monitor_list(
                active_since_date       => sysdate - 30 / (24*60),
                type                    => 'TEXT'
        ) text_line 
from    dual
;

select 
        dbms_sqltune.report_sql_monitor(
--              sql_id                  => '&m_sql_id',
--              start_time_filter       => sysdate - 30/(24 * 60),
--              sql_exec_id             => &m_exec_id,
                type                    =>'TEXT'
        ) text_line 
from    dual
;

spool off




Here’s a variation that reports the details of the most recently completed execution of a query with the specified SQL_ID:

set linesize 255
set pagesize 200
set trimspool on
set long 200000

column text_line format a254
set heading off

define m_sql_id = 'fssk2xabr717j'

spool rep_mon

SELECT  dbms_sqltune.report_sql_monitor(
                sql_id=> v.sql_id,
                sql_exec_id => v.max_sql_exec_id
        ) text_line
from     (
        select
                sql_id,
                max(sql_exec_id)        max_sql_exec_id
        from
                v$sql_monitor
        where
                sql_id = '&m_sql_id'
        and     status like 'DONE%'
        group by
                sql_id
        )       v
;

spool off

set heading on
set linesize 132
set pagesize 60

And a sample of the text output, which is the result of monitoring the query “select * from dba_objects” (with an arraysize of 1,000 set in SQL*Plus):


SQL Monitoring Report

SQL Text
------------------------------
select /*+ monitor */ * from dba_objects

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  SYS (262:54671)
 SQL ID              :  7nqa1nnbav642
 SQL Execution ID    :  16777216
 Execution Started   :  04/05/2018 19:43:42
 First Refresh Time  :  04/05/2018 19:43:42
 Last Refresh Time   :  04/05/2018 19:45:04
 Duration            :  82s
 Module/Action       :  sqlplus@linux12 (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@linux12 (TNS V1-V3)
 Fetch Calls         :  93

Global Stats
===========================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
===========================================================================
|    0.31 |    0.29 |     0.00 |     0.02 |    93 |   6802 |   18 |   9MB |
===========================================================================

SQL Plan Monitoring Details (Plan Hash Value=2733869014)
=================================================================================================================================================================================
| Id |                Operation                 |       Name       |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                                          |                  | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
=================================================================================================================================================================================
|  0 | SELECT STATEMENT                         |                  |         |      |        83 |     +0 |     1 |    91314 |      |       |       |          |                 |
|  1 |   VIEW                                   | DBA_OBJECTS      |   91084 | 2743 |        83 |     +0 |     1 |    91314 |      |       |       |          |                 |
|  2 |    UNION-ALL                             |                  |         |      |        83 |     +0 |     1 |    91314 |      |       |       |          |                 |
|  3 |     TABLE ACCESS BY INDEX ROWID          | SUM$             |       1 |      |           |        |       |          |      |       |       |          |                 |
|  4 |      INDEX UNIQUE SCAN                   | I_SUM$_1         |       1 |      |           |        |       |          |      |       |       |          |                 |
|  5 |     TABLE ACCESS FULL                    | USER_EDITIONING$ |       1 |    2 |         1 |     +0 |   872 |        1 |      |       |       |          |                 |
|  6 |      TABLE ACCESS BY INDEX ROWID BATCHED | OBJ$             |       1 |    3 |           |        |       |          |      |       |       |          |                 |
|  7 |       INDEX RANGE SCAN                   | I_OBJ1           |       1 |    2 |           |        |       |          |      |       |       |          |                 |
|  8 |     FILTER                               |                  |         |      |        83 |     +0 |     1 |    91312 |      |       |       |          |                 |
|  9 |      HASH JOIN                           |                  |   91394 |  211 |        83 |     +0 |     1 |    91312 |      |       |    2M |          |                 |
| 10 |       TABLE ACCESS FULL                  | USER$            |     125 |    2 |         1 |     +0 |     1 |      125 |      |       |       |          |                 |
| 11 |       HASH JOIN                          |                  |   91394 |  207 |        83 |     +0 |     1 |    91312 |      |       |    1M |   100.00 | Cpu (1)         |
| 12 |        INDEX FULL SCAN                   | I_USER2          |     125 |    1 |         1 |     +0 |     1 |      125 |      |       |       |          |                 |
| 13 |        TABLE ACCESS FULL                 | OBJ$             |   91394 |  204 |        83 |     +0 |     1 |    91312 |   13 |   9MB |       |          |                 |
| 14 |      TABLE ACCESS FULL                   | USER_EDITIONING$ |       1 |    2 |         1 |     +0 |   872 |        1 |    2 | 16384 |       |          |                 |
| 15 |      NESTED LOOPS SEMI                   |                  |       1 |    2 |           |        |       |          |      |       |       |          |                 |
| 16 |       INDEX SKIP SCAN                    | I_USER2          |       1 |    1 |           |        |       |          |      |       |       |          |                 |
| 17 |       INDEX RANGE SCAN                   | I_OBJ4           |       1 |    1 |           |        |       |          |      |       |       |          |                 |
| 18 |      TABLE ACCESS FULL                   | USER_EDITIONING$ |       1 |    2 |           |        |       |          |      |       |       |          |                 |
| 19 |     HASH JOIN                            |                  |       2 |    4 |         1 |    +82 |     1 |        1 |      |       |       |          |                 |
| 20 |      NESTED LOOPS                        |                  |       2 |    4 |         1 |    +82 |     1 |        2 |      |       |       |          |                 |
| 21 |       STATISTICS COLLECTOR               |                  |         |      |         1 |    +82 |     1 |        2 |      |       |       |          |                 |
| 22 |        TABLE ACCESS FULL                 | LINK$            |       2 |    2 |         1 |    +82 |     1 |        2 |    2 | 16384 |       |          |                 |
| 23 |       TABLE ACCESS CLUSTER               | USER$            |       1 |    1 |         1 |    +82 |     2 |        2 |      |       |       |          |                 |
| 24 |        INDEX UNIQUE SCAN                 | I_USER#          |       1 |      |         1 |    +82 |     2 |        2 |    1 |  8192 |       |          |                 |
| 25 |      TABLE ACCESS FULL                   | USER$            |       1 |    1 |           |        |       |          |      |       |       |          |                 |
=================================================================================================================================================================================


1 row selected.


In a future note I’ll show an example of using one of these reports to identify the critical performance issue with an SQL statement that was raised recently on the ODC (OTN) database forum, but I’ll just point out one detail from this report. The “Time active (s)” says the query ran for about 83 seconds, but the Global Stats section tells us the elapsed time was 0.31 seconds. In this case the difference between these two is the time spent passing the data to the client.

Footnote

It is possible to force monitoring for an SQL statement with the /*+ monitor */ hint. Do be careful with this in production systems; each time the statement is executed the session will try to get the “Real-time descriptor latch” which is a latch with no latch children so if you monitor a lightweight statement that is called many times from many sessions you may find you lose a lot of time to latch contention and the attendant CPU spinning.

 

Lock Types

Wed, 2018-03-28 06:24

Every now and again I have to check what a particular lock (or enqueue) type is for and what the associated parameter values represent. This often means I have to think about the names of a couple of views and a collection of columns – then create a few column formats to make the output readable (though sometimes I can take advantage of the “print_table()” procedure that Tom Kyte a long time ago.  It’s only takes a little time to get the code right, but it’s a nuisance when I’m in a hurry so I’ve just scribbled out a few lines of a script that takes a lock type as an input parameter and reports all the information I want.

rem
rem     Script:         lock_types.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2018
rem     Usage:          start lock_types {lock type}
rem

define m_lock_type='&1'

column  display new_value m_display

select
        case when substr(version,1,2) = '12'
        then
                'display_name'
        else
                'name'
        end                     display
from
        v$instance
;

set linesize 160
set pagesize 60
set trimspool on

column type             format a4
column name             format a32
column description      format a132
column id1_tag          format a32
column id2_tag          format a32
column is_user          format a4 heading "User"
column is_recycle       format a4 heading "Rcyc"

set feedback off
break on report skip 1

spool lock_types

select  *
from    V$lock_type
where   type = upper('&m_lock_type')
order by
        type
;

column  name                    format a42
column  parameter1              format a9
column  parameter2              format a24
column  parameter3              format a22
column  wait_class              format a14
column  display_name            format a42

select
        eve.name,
        eve.parameter1,
        eve.parameter2,
        eve.parameter3,
        eve.wait_class,
        nullif(eve.&m_display, eve.name) display_name
from
        v$event_name    eve
where
        eve.name like 'enq: ' || upper('&m_lock_type') || '%'
order by
        nullif(eve.wait_class,'Other'),
        eve.name
;

set feedback on

I’ve included a check (and hack) on the value of the major version because 12c introduced a “display_name” as well as a “name” for events, and the latter is sometimes a little more descriptive than the former, so it’s nice to have a single script that could print two different values for the versions that have them.

Here’s a sample of the output when I pass ‘IV’ as an input parameter:

TYPE NAME                             ID1_TAG                          ID2_TAG                          User Rcyc
---- -------------------------------- -------------------------------- -------------------------------- ---- ----
DESCRIPTION                                                                                                                              CON_ID
------------------------------------------------------------------------------------------------------------------------------------ ----------
IV   Library Cache Invalidation       object #                         time stamp                       NO   NO
Synchronizes library cache object invalidations across instances                                                                              0

NAME                                       PARAMETER PARAMETER2               PARAMETER3             WAIT_CLASS     DISPLAY_NAME
------------------------------------------ --------- ------------------------ ---------------------- -------------- ------------------------------------------
enq: IV -  contention                      type|mode id1                      id2                    Other

As you can see from the presence of the con_id column in v$lock_type this output came from a 12c instance. I picked the IV lock because that’s the one that prompted me to check the meanings of the id[12] and parameter[123] columns when a question about waits for the IV lock appeared recently on Oracle-L. I’ve got two reasons for carrying on with this particular example – first that it demonistrates that the descriptions can be wrong, second that it allows me to demonstrate a quick tip on translation.

The question on Oracle-L related to a 4-node RAC system and reported one instance suffering long waits on the IV enqueue on a fairly regular basis when running a particular batch task. The OP reported the following values as the p1, p2, p3 values from v$session while the wait was going on:


P1    type|mode 1230372869
P2    id1       1398361667
P3    id2                3

According to the details in v$lock_type the enqueue is about library cache invalidation across instances – and that fits the OPs complaint because the system is a RAC system. The id1 value is supposed to be an obj# (object_id), but the OP said it wasn’t; and the id2 value is supposed to be a timestamp, but 3 is an odd value for a timestamp (though it might represent – for example – the 3 second wait that is a common time-out interval for enqueues). So, clearly, the descriptions can be wrong.

Translation

Take another look at p1 and p2, and turn them into Hexadecimal:


1230372869 (dec) = 0x49560005 (hex)
1398361667 (dec) = 0x53594E43 (hex)

If you happen to be good with Hex and ASCII code you’ll know that 2-byte values in the range 41-5F are mostly the capital letters of the Roman alphabet (while 61 – 7f are mostly the lower case letters), so a second translation step gives us:


1230372869 (dec) = 0x49560005 (hex) = 'IV'  5
1398361667 (dec) = 0x53594E43 (hex) = 'SYNC'

The p1 parameter is described (correctly) as “type/mode” – this is an IV enqueue held or requested in mode 5; the p2 parameter is not an object number, it looks more like a text description of why the enqueue is being requested (the enqueue is, after all, described as being used to “synchronize library cache object invalidation”).

I still don’t know what the final parameter represents – I doubt if it’s really about a three second wait (but that could be checked by examining v$session_wait over a period of several seconds or v$session_wait_history), it might be an indication of the instance that the session is trying to synchronize with (though, again, that seems a long shot), or it might just be a “reason-code” describing why the synchronisation is necessary.

Whenever in doubt about the meaning for the various parameters, it’s always worth a quick dec -> hex -> ASCII check, just in case it offers some clues about the function of the enqueue.

 

Reference Costs

Mon, 2018-03-19 02:44

The partitioning option “partition by reference” is a very convenient option which keeps acquiring more cute little features, such as cascading truncates and cascading splits, as time passes – but what does it cost and would you use it if you don’t really need to.

When reference partitioning came into existence many years ago, I had already seen several performance disasters created by people’s enthusiasm for surrogate keys and the difficulties this introduced for partition elimination; so my first thought was that this was a mechanism that would have a hugely beneficial effect on systems which (in 20:20 – or 6:6 if you’re European – hindsight) had been badly designed and would otherwise need a lot of re-engineering to use partitioning effectively.

(Side note: Imagine you have partitioned an orders table on colX which is a column in the real (business-oriented) candidate key, but you’ve created a surrogate key which is used as the target for a foreign key from the order_lines tables – how do you get partition-wise joins between orders and order_lines if you haven’t got the partitioning column in the order_lines table ?)

So ref partitioning was a good way to workaround a big existing problem and, whatever overheads it introduced, the benefit was potentially so  huge that you wouldn’t care (or, probably, notice) that your system was less efficient than it ought to be. But what if you’re working on a new project and still have control of the physical design – how does that change the cost/benefit analysis.

It’s actually taken me several years to get round to producing a little demonstration to highlight one of the obvious costs of reference partitioning – even though it’s a very simple demo raising the obvious question: ‘how much work does Oracle have to do to find the right partition when inserting a “child” row ?’ If you chose to implement reference partitioning without asking that simple question you may be using a lot more machine resources than you really need to, although you may not actually be heading for a disastrous performance problem.

As a demonstration of the issue I’m going to set up something that approximates an order/order_lines model in two ways, one using reference partitioning and one using a copied column, to see what differences show up when you start loading data.

rem
rem     Script:         pt_ref.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table orders (
        id              number(10,0) not null,
        id_cust         number(10,0) not null,
        date_ordered    date not null,
        padding         varchar2(150)
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
);

create unique index ord_pk on orders (id);
alter table orders add constraint ord_pk primary key(id);

create table order_lines (
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by reference (orl_fk_ord)
;

create unique index orl_pk on order_lines (id_ord, line_number);
alter table order_lines add constraint orl_pk primary key (id_ord, line_number);

create table order_lines_2 (
        date_ordered    date,
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl2_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
)
;

create unique index orl2_pk on order_lines_2 (id_ord, line_number);
alter table order_lines_2 add constraint orl2_pk primary key (id_ord, line_number);

It’s a bit of a bodge job as far as modelling goes, but that’s to keep workload comparisons easy and make a point without writing too much code. All I’ve got is an orders table partitioned by date and an order_lines table that I want partitioned the same way. I’ve handled the requirement for partitioning order_lines in two ways, one is partition by reference and the other is to copy down the partitioning column from the orders table. (In my view the “real” key for an orders table should be (customer identifier, order date, counter) and if I thought efficient partitioning was going to be a necessary feature for scalability I would copy down all three columns. Depending on the nature of the business I would compress the primary key index on orders on one or two of the columns, and the foreign key index on order_lines on one, two, or three of its columns)

Now all I have to do is load some data into the tables. First the orders table:

insert into orders(
        id, id_cust, date_ordered, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        rownum                                  id,
        trunc(dbms_random.value(10000,20000))   id_cust,
        to_date('01-Jan-2018') +
                trunc((rownum-1)/100)           date_ordered,
        rpad('x',40)                            padding
from
        g,g
where
        rownum <= 2e4
;

commit;
execute dbms_stats.gather_table_stats(user,'orders')

This produces 100 orders per day, for 200 days which fits within the seven months of pre-declared partitions. I’ve gathered table stats on the table because that’s probably the best way to deal with any requirements for block cleanout after the insert. (Note: I’m avoiding interval partitioning in this example because that’s just another complication to add to the comparison and, as I reported a few days ago, introduces another massive inefficiency on data loading.)

Now I’ll insert some order_lines rows at 5 lines per order into the two versions of the order_lines tables. One of them, of course, has to have a date generated using the same algorithm that I used for the orders table. Note that I’ve made a call to dbms_random.seed(0) before each insert to guarantee that the same “random” values will be inserted in both table.

execute dbms_random.seed(0)

insert into order_lines_2(
        date_ordered, id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        to_date('01-Jan-2018') +
                trunc((rownum-1)/500)           date_ordered,
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;


execute dbms_random.seed(0)

insert into order_lines(
        id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;

What I haven’t shown in the code is the snapshot wrapping I used to check the session stats, system latch activity and system rowcache activity – which I thought would give me the best indication of any variation in workload. In fact, of course, the first and simplest variation was the elapsed time: 4.5 seconds for the ref partitioned table, 2.5 seconds for the explicitly created table (regardless of which insert I did first), and it was nearly all pure CPU time.

It turned out that the rowcache stats showed virtually no variation, and the latch stats only showed significant variation in the latches that I could have predicted from the session stats, and here are the most significant session stats that highlight and explain the difference in times:

Explicitly Created
------------------
CPU used by this session                                                   231
DB time                                                                    242
db block gets                                                          219,471
db block changes                                                        27,190
redo entries                                                            15,483
redo size                                                           24,790,224
HSC Heap Segment Block Changes                                           2,944

Ref partitioned
---------------
CPU used by this session                                                   515
DB time                                                                    532
db block gets                                                          615,979
db block changes                                                       418,025
redo entries                                                           209,918
redo size                                                           70,043,676
HSC Heap Segment Block Changes                                         100,048

These results were from 12.1.0.2, but the figures from 12.2.0.1 and 11.2.0.4 were similar though the CPU time dropped as the version number went up: what you’re seeing is the effect of turning an array insert (for the precreated table) into single row processing for the ref partitioned table. Basically it seems that for every row inserted Oracle has to do something to work out which partition the row should go into, and while it does that work it release any pins of buffers it would have been holding from the previous row’s insert; in other words, various optimisations relating to array inserts are not taking place.

  • Looking in more detail at the figures for the ref partition insert:
  • The 100,000 “HSC heap Segment Block Changes” equate to the 100,000 rows inserted into the table
  • Add the single row index updates to the primary key and you get 200,000 redo entries.
  • For every individual row inserted Oracle has to do a current mode (db block gets) check against the primary key of the orders table – but when array processing the root block can be pinned.

We can get a closer look at the differences by taking snapshots of v$segstat (or v$segment_statistics), to see the following (pre-created table on the left):


ORD_PK                                |   ORD_PK
  logical reads          199,440      |     logical reads          300,432
                                      |
ORDER_LINES_2 - P201801               |   ORDER_LINES - P201801
  logical reads            2,112      |     logical reads           16,960
  db block changes         1,280      |     db block changes        16,944
                                      |
ORDER_LINES_2 - P201802               |   ORDER_LINES - P201802
  logical reads            2,256      |     logical reads           16,144
  db block changes         1,248      |     db block changes        15,088
                                      |
ORDER_LINES_2 - P201803               |   ORDER_LINES - P201803
  logical reads            2,288      |     logical reads           17,264
  db block changes         1,376      |     db block changes        16,560
                                      |
ORDER_LINES_2 - P201804               |   ORDER_LINES - P201804
  logical reads            2,672      |     logical reads           16,768
  db block changes         1,280      |     db block changes        16,144
                                      |
ORDER_LINES_2 - P201805               |   ORDER_LINES - P201805
  logical reads            2,224      |     logical reads           17,472
  db block changes         1,264      |     db block changes        16,528
                                      |
ORDER_LINES_2 - P201806               |   ORDER_LINES - P201806
  logical reads            2,624      |     logical reads           16,800
  db block changes         1,328      |     db block changes        16,160
                                      |
ORDER_LINES_2 - P201807               |   ORDER_LINES - P201807
  logical reads            1,376      |     logical reads           10,368
  db block changes           864      |     db block changes        10,752
                                      |
ORL2_PK                               |   ORL_PK
  logical reads           10,640      |     logical reads          206,352
  db block changes         7,024      |     db block changes       104,656

The right hand data set does an extra 100,000 logical reads on the ORD_PK index (top set of lines) which I think are the 100,000 gets on the root block that was pinned for the table on the left – the numbers don’t quite add up, so there’s some extra complexity that I haven’t guessed correctly.

The insert into the ORL[2]_PK index (lines) is single row processed for the right hand table – with, I think, the logical reads recording two current gets per insert.

Every partition of the table, except the last, shows 15,000 db block changes, totalling a difference of about 100,000 db block changes corresponding to the single rows being inserted. Then ORL[2]_PK shows another 100,000 db block changes, giving us the 200,000 we saw as redo entries and 400,000 (when doubled up to allow for the undo) db block changes that we saw in total.

Finally we need to explain the difference of 400,000 db block gets between the two sets of session stats – and I think this is the extra 100,000 for ORD_PK, the 100,000 for the table inserts, and 200,000 for the ORL[2]_PK index, which I think might be explained as 100,000 as a current get that checks for “duplicate key” and 100,000 gets to do the actual insert.

Bottom Line, though – if you use reference partitioning every array insert seems to turn into single row processing with the attendant increase in buffer gets, undo and redo generated,  latch activity, and CPU used as Oracle checks for every single row which partition it should go into: and there doesn’t seem to be any optimisation that caters for “this row belongs in the same partition as the previous row”. You may decide that this extra cost due to reference partitioning is worth it for the benefits that reference partitioning supplies – it’s all down to what your application does, especially in terms of aging data perhaps – but it’s nice to know that this cost is there so that you can do a better cost/benefit analysis.

Footnote:

Interested readers might like to extend this test to a multi-layered set of ref-partitioned tables to see if the increase in overheads is linear or geometric.

 

Keeping Intervals

Thu, 2018-03-15 03:03

I’ve recently been reminded of a blog post I wrote a couple of years ago that discussed the issue of running into the hard limit of 2^20 -1 as the number of segments for a (composite) partitioned table – a problem that could arise in a relatively short time if you used a large number of hash subpartitions in an interval/hash composite partitioned table (you get about 2 years and 10 months of daily partitions at 1,024 subpartitions per day, for example).

A natural follow-on from that article is to think through a strategy for dropping old partitions sufficiently early that you don’t hit the limit as new partitions are created. This, of course, pretty much defeats the point of interval partitioning – instead of planning to add partitions “just in time” you now have to eliminate them “just in time”. Amongst other issues, we’re going to find that interval partitioning manages to re-introduce a problem with range partitioning that Oracle got rid of in Oracle 10g.

So let’s test the obvious option: drop the oldest partition(s) in time to keep head-room for new partitions; for convenience we’ll start with a simple interval partitioned table with a few pre-declared range partitions and a few automatically generated interval partitions. All the examples here were run under 12.1.0.2:


rem
rem     Script:         pt_merge.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(id, v1, padding)
partition by range (id) interval (1e4)
(
        partition p10000 values less than (1e4),
        partition p20000 values less than (2e4),
        partition p30000 values less than (3e4),
        partition p40000 values less than (4e4),
        partition p50000 values less than (5e4)
)
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.


I’ve created 100,000 rows and since the partitions I’ve pre-declared have an (unreachable) upper bound of only 50,000 Oracle will have added a further 6 partitions to the table to hold the data for values up to 110,000 (with just one row in the last partition). For testing purposes I’ve created the table in an otherwise empty tablespace so when I check the block address of each segment I can see the location (and size) of the segments so far. So here’s the list of names and locations:

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.

No surprises so far. So let’s pretend we know the dreaded ORA-14299 or ORA-14300 will be arriving soon and try to drop the first 5 partitions to keep the partition count below the limit. Here’s a cut-n-paste from an SQL*Plus session that tries to do that one partition at a time:

SQL> alter table t1 drop partition p10000;

Table altered.

SQL> alter table t1 drop partition p20000;

Table altered.

SQL> alter table t1 drop partition p30000;

Table altered.

SQL> alter table t1 drop partition p40000;

Table altered.

SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped

We can’t drop partition p50000 – it’s the highest partition that wasn’t created automatically, and we have to leave an “anchor” partition in place for interval partitioning to work from. By querying user_tab_partitions we can even see that this partition is flagged a little differently from the others:


select
        partition_name, interval, high_value 
from
        user_tab_partitions
where
        table_name = 'T1'
order by
        partition_position
;


PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P50000                 NO  5e4
SYS_P69844             YES 60000
SYS_P69845             YES 70000
SYS_P69846             YES 80000
SYS_P69847             YES 90000
SYS_P69848             YES 100000
SYS_P69849             YES 110000

7 rows selected.

So, at first sight, we’re stuck. If we’re dropping old partitions we will eventually get to a point where there’s only one “real” range partition at the bottom and then we can’t drop any more historic partitions. There are two solutions to this problem, explained a long time ago here and here by Harald van Breederode.

Option 1

Convert the interval partitioned table to a range partitioned table and back again, and if you know the interval (and you can always look it up in the data dictionary) there’s a quick and dirty way of doing that. Here’s a cut-n-paste demonstrating the method and effect:


SQL> alter table t1 set interval (10000);

1Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position ; 

PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P10000                 NO  1e4
P20000                 NO  2e4
P30000                 NO  3e4
P40000                 NO  4e4
P50000                 NO  5e4
SYS_P69850             NO  60000
SYS_P69851             NO  70000
SYS_P69852             NO  80000
SYS_P69853             NO  90000
SYS_P69854             NO  100000
SYS_P69855             NO  110000

11 rows selected.

SQL> select table_name, partitioning_type, interval from user_part_tables;

TABLE_NAME           PARTITION INTERVAL
-------------------- --------- --------------------
T1                   RANGE     1E4

1 row selected.

Every single partition has just become a range-based partition, but the table is still interval partitioned. This is a tidy solution, but there’s one obvious, generic, drawback to the method.  The “theory” of interval partitioning is that you don’t have to pre-create partitions in anticipation of the data arriving – so what will happen if a (possibly bad) row arrives weeks ahead of schedule and you find that Oracle has created (say) partition 85,001 with a gap of 12,000 partitions between the current high partition and the new one. If you use this “convert to range and back” trick then you’ll have a single partition covering the entire range where you were expecting (eventually) to have 12,000 partitions. Every time you convert from interval to range and back you’d better have code that checks if there are any gaps first, and then does loads of “split partition” –  or comes up with some other strategy – to address the side effects.

Option 2

When you’ve got just one range partition left, merge the bottom two partitions – this makes the next partition up a range partition without affecting any other partitions. After recreating the original table and dropping the first 4 partitions this is how things go:


SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped


SQL> alter table t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position;

PARTITION_NAME         INTERVAL             HIGH_VALUE
---------------------- -------------------- --------------------------
P_LOW                  NO                   60000
SYS_P69863             YES                  70000
SYS_P69864             YES                  80000
SYS_P69865             YES                  90000
SYS_P69866             YES                  100000
SYS_P69867             YES                  110000

6 rows selected.

Is this too good to be true ? Of course it is, but you may have to pause for a moment to think why. When you merge two partitions Oracle copies the contents of the two segments into a new segment – always; even if one of the two segments is empty. When you do a “split partition” Oracle runs a check to see if the split would leave all the data in a single segment and if it would then Oracle doesn’t do any copying but simply plays clever games in the data dictionary – unfortunately Oracle doesn’t use the same sort of trick to optimise a merge.

So the merge partition mechanism carries less risk than the “interval/range/interval”, but you either pay the cost of the merge or you carefully code the mechanism so that the bottom two partitions are always empty when you merge: for example you might always leave the bottom (range) partition empty and use your scheduled code to truncate (or exchange out) the lowest interval partition, then do the merge.

The good news

When you upgrade to 12.2.0.1 you can drop the lowest partition – and Oracle will simply turn the lowest interval partition currently in existence into a range partition. (That may be a bit of a nuisance if there’s a gap between the range partition and the current lowest interval partition.)

The Bad News

It doesn’t really matter which strategy you use to deal with this problem (even if you’ve upgraded to 12.2) – you still pay one other penalty for both mechanisms. And that’s the bit which re-introduces a problem that last existed in 9i.

Ask youself “How does Oracle know which interval a partition is for and what the limit is on the partitioning key ?” Then look at the data dictionary, or maybe build a very simple model and trace what happens when you use either of the methods above – but in your model create a significant number or partitions first. I’m going to take the data dictionary method – starting from the point where I’ve created and populated the table. Again this is cut-n-paste, and do note that I switch to the sys account after creating the table:


SQL> select object_id, object_name, subobject_name from user_objects;

 OBJECT_ID OBJECT_NAME          SUBOBJECT_NAME
---------- -------------------- ----------------------
    185164 T1
    185165 T1                   P10000
    185166 T1                   P20000
    185167 T1                   P30000
    185168 T1                   P40000
    185169 T1                   P50000
    185170 T1                   SYS_P69868
    185171 T1                   SYS_P69869
    185172 T1                   SYS_P69870
    185173 T1                   SYS_P69871
    185174 T1                   SYS_P69872
    185175 T1                   SYS_P69873

12 rows selected.

SQL> connect / as sysdba
Connected.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185165     185165         10
    185166     185166         20
    185167     185167         30
    185168     185168         40
    185169     185169         50
    185170     185170 2147483648
    185171     185171 2147483649
    185172     185172 2147483650
    185173     185173 2147483651
    185174     185174 2147483652
    185175     185175 2147483653

11 rows selected.

I’ve queried user_objects to find the object_id of the table then used that as the “base object number” (bo#) to query tabpart$, which holds the table partition definitions. Note how there are 5 partitions where the partition number goes up 10 at a time, and 6 where it goes up one at a time. Prior to 10g (and interval partitions, of course) the stored partition number would increase in steps of 1 but if you wanted to do a split, merge or drop partition (and the last of the three was the most significant one) every single partition position about the split/merge/drop point would have to be renumbered, and that was done by a single row update to the data dictionary to keep the numbering intact. The steps of 10 were introduced in 10g to deal with the inherent performance problems – particularly the shared pool catastrophe that this could cause.

The steps of 1 for interval partitions allows Oracle to keep track (easily) of what high_value each partition partition represents, and the highest legal partition. Try inserting the values 1,000,000 into the table and re-run the query against tabpart$ and you’ll see Oracle adding part# = 2147483743. So what do you think is going to happen if you try to apply the two mechanisms ?

If you do the interval/range/interval switch every interval part# will be renumbered so to follow the “increment by 10” pattern. If you drop partitions p10000 to p40000 nothing happens to the existing part# values until you get to the command to merge p50000 with the next partition up and then you see this:


SQL> alter table test_user.t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185177     185177         10
    185171     185171 2147483648
    185172     185172 2147483649
    185173     185173 2147483650
    185174     185174 2147483651
    185175     185175 2147483652
    185176     185176 2147483742

7 rows selected.


The newly merged partition is a new object, of course, so has a completely new obj# and dataobj#, and it’s been given the part# of 10 (the lowest value for a clean range-partitioned object). Every single interval partition has had its part# decreased by one. The lowest possible interval partition is always given the part# of 2147483648 (0x80000000) and the partition numbering increments by 1 from there onwards. (The numbering gets a little more subtle when you have composite partitioning but a similar approach takes place in tabcompart$).

Pause for thought – if you’re thinking of creating an interval partitioned table that could get close to a running level of 1 million partitions and you start to get rid of old partitions in any version of Oracle then each “drop/merge” partition will update about 1 million rows in the data dictionary – and that’s assuming you don’t have any local indexes that will need to be renumbered in the same way!

Here’s a critical part of the output from tkprof when I recreated the table with 1,000,000 rows – which means 101 partitions – and created a local index on it, before dropping the first 4 partitions and then enabled tracing just before merging the bottom interval partition with the anchor range partition.


update indpart$ set dataobj# = :1, part# = :2, flags = :3, ts# = :4, file# =
  :5, block# = :6, pctfree$ = :7, initrans = :8, maxtrans = :9, analyzetime =
  :10, samplesize = :11, rowcnt = :12, blevel = :13, leafcnt = :14, distkey =
  :15, lblkkey = :16, dblkkey = :17, clufac = :18, pctthres$ = :19
where
 obj# = :20


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.01          0         94        480          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.01       0.01          0         94        480          94


update tabpart$ set dataobj# = :1, part# = :2, ts# = :3, file# = :4, block# =
  :5, pctfree$ = :6, pctused$ = :7, initrans = :8, maxtrans = :9, flags = :10,
   analyzetime = :11, samplesize = :12, rowcnt = :13, blkcnt = :14, empcnt =
  :15, avgspc = :16, chncnt = :17, avgrln = :18
where
 obj# = :19


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.00          0        188        489          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.00       0.00          0        188        489          94

That’s not a lot of work for my little example with less than 100 partitions – but when you’ve got a million of them, with a handful of indexes, and the partitions have been created over time that’s going to turn into a lot of work that’s going to disrupt the shared pool for a long time, generate a lot of redo, and do a lot of disk reads and writes.

So be cautious with interval partitioning – even in 12.2 (and 18.1, possibly) the ease of use may disappear if you realise too late that you’re going to get into a cycle of partition maintenance.

Footnote for composite partitioning – the limits of 2^20-1 segments (hence subpartitions) still applies, but the necessary update is relevant only at the partition level, not at the subpartition level. The objects updated are tabcompart$ and indcompart$.

Update (included for ironic effect)

The day I posted this note my “Oracle Support Hot Topics” email with a report of the following bug:

Bug 19294302 : DBMS_REDEFINITION DOES NOT WORK WITH INTERVAL PARTITIONS

This was reported for 11.2.0.4, fixed in 12.2. The rediscovery information is:

ORA-14024 during copy_table_dep when the interim table is interval partitioned.

The problem arises if you change a table from simple range partitioned to range with interval – so might be relevant if you have a strategy of doing the interval/range/interval trick.

 

 

Pages