Skip navigation.

Jonathan Lewis

Syndicate content Oracle Scratchpad
Just another Oracle weblog
Updated: 2 hours 25 min ago

12c Temporary

Fri, 2014-03-14 12:41

Just one of those little snippets to cover something new and remind of something old. A single session can now have three (or more) temporary tablespaces in use at the same time for different reasons.

  • In 12c you can set parameter temp_undo_enabled to true, at which point the undo relating to global temporary tables (GTTs) will be written into the database default temporary tablespace, which means it won’t generate redo. As an interesting side effect this also means that you can do DML on temporary objects in a physical standby database. Currently the undo segment shows as type “UNDEFINED” in v$sort_usage. A detail to watch out for, though – it looks as if each session may get its own temporary undo segment – so be careful about specifying the extent size and tempfile size for the database default temporary tablespace.
  • In 11g you can specify a tablespace (though not a tablespace group) in the ‘create global temporary table’ statement, which means you keep activity about GTTs separated from the I/O resulting from sorts and hash joins etc. spilling to disc; in fact you could specify a different temporary tablespace for every GTT if you wanted to – and I could imagine a case for having a couple of different temporary tablespaces to accommodate GTTs with very different usage characteristics. (Unfortunately you still can’t specify a tablespace in the dbms_lob.create_temporary() function).  If you don’t specify a tablespace for a GTT it will go into the default temporary tablespace of the user who is using it (not the database default, and not the default for the definer of the GTT). If you create indexes on a GTT they will automatically go into the same tablespace as the table.
  • Finally, of course, there’s the default temporary tablespace for the user and this is where GTTs will go if they don’t have a tablespace specified, and where all the scratch data (sorts, hash tables, factored subqueries et. al.) will go.

This combination means, of course, that you could manage to do a single “insert as select” writing a GTT to one temporary tablespace, with its undo going to a second temporary tablespace, and the spill from a sort or hash join in the select going to a third. The flexibility probably won’t make much difference to performance (for most people), but it’s possible that it will make it easier to monitor where the work is coming from if you’r ever in the position where your single temporary tablespace is subject to a lot of I/O.

Footnote:

In the past I’ve advised DBAs to set up a small number of tablespaces (or tablespace groups) so that they can allocate different classes of users – typically grouped by business function – to different temporary tablespaces. The ability to allocate GTTs to temporary tablespaces allows a further degree of refinement to this strategy.


12c Temporary

Fri, 2014-03-14 12:41

Just one of those little snippets to cover something new and remind of something old. A single session can now have three (or more) temporary tablespaces in use at the same time for different reasons.

  • In 12c you can set parameter temp_undo_enabled to true, at which point the undo relating to global temporary tables (GTTs) will be written into the database default temporary tablespace, which means it won’t generate redo. As an interesting side effect this also means that you can do DML on temporary objects in a physical standby database. Currently the undo segment shows as type “UNDEFINED” in v$sort_usage. A detail to watch out for, though – it looks as if each session may get its own temporary undo segment – so be careful about specifying the extent size and tempfile size for the database default temporary tablespace.
  • In 11g you can specify a tablespace (though not a tablespace group) in the ‘create global temporary table’ statement, which means you keep activity about GTTs separated from the I/O resulting from sorts and hash joins etc. spilling to disc; in fact you could specify a different temporary tablespace for every GTT if you wanted to – and I could imagine a case for having a couple of different temporary tablespaces to accommodate GTTs with very different usage characteristics. (Unfortunately you still can’t specify a tablespace in the dbms_lob.create_temporary() function).  If you don’t specify a tablespace for a GTT it will go into the default temporary tablespace of the user who is using it (not the database default, and not the default for the definer of the GTT). If you create indexes on a GTT they will automatically go into the same tablespace as the table.
  • Finally, of course, there’s the default temporary tablespace for the user and this is where GTTs will go if they don’t have a tablespace specified, and where all the scratch data (sorts, hash tables, factored subqueries et. al.) will go.

This combination means, of course, that you could manage to do a single “insert as select” writing a GTT to one temporary tablespace, with its undo going to a second temporary tablespace, and the spill from a sort or hash join in the select going to a third. The flexibility probably won’t make much difference to performance (for most people), but it’s possible that it will make it easier to monitor where the work is coming from if you’r ever in the position where your single temporary tablespace is subject to a lot of I/O.

Footnote:

In the past I’ve advised DBAs to set up a small number of tablespaces (or tablespace groups) so that they can allocate different classes of users – typically grouped by business function – to different temporary tablespaces. The ability to allocate GTTs to temporary tablespaces allows a further degree of refinement to this strategy.


Shrink Space

Thu, 2014-03-13 01:08

Here’s a lovely effect looking at v$lock (on 11.2.0.4)

select
        sid, type, id1, id2, lmode, request, ctime, block
from
        V$lock
where   sid in (
                select  sid
                from    V$session
                where   username = 'TEST_USER'
        )
order by
        sid, type desc
;

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
       145 TX     589833       7518          6          0        489          0
           TM      92275          0          2          6        489          1
           SK          7   25165955          6          0        489          0
           AE        100          0          4          0        582          0

       148 TX     524308       7383          6          0        490          0
           TM      92275          0          2          6        490          1
           SK          7   25173379          6          0        490          0
           AE        100          0          4          0        998          0

You’ll notice I’ve got two sessions holding a TM lock on the same table (object 92275) in mode 2 (sub-share) and waiting for an exclusive lock on the same table. Both sessions are recording the fact that they are blocking something else. You’ll have trust me when I say there are no other user sessions on the system at this point, and none of the background sessions is doing anything with that table.

The clue to what’s happening is the SK lock – it’s the “segment shrink” lock. I had two sessions start an “alter index I_n shrink space” (two different indexes on the same table) at the same time. The problem is that “shrink space” without the “compact” tries to drop the highwater mark on the index’ allocated space after completing the shrink phase – and there’s a defect in the internal code that tries to get the required exclusive lock on the underlying table: it doesn’t seem to allow for all the possible ways you can fail to get the lock. If you look at v$session_wait_history for either of these sessions, you’ll see something like the following:


  SID  SEQ#     EVENT# EVENT                            P1         P2         P3  WAIT_TIME WAIT_TIME_MICRO TIME_SINCE_LAST_WAIT_MICRO
----- ----- ---------- ------------------------ ---------- ---------- ---------- ---------- --------------- --------------------------
  145     1        798 Wait for shrink lock              0          0          0         10           99954                      70137
          2        235 enq: TM - contention     1414332422      92275          0        600         6002022                        101
          3        798 Wait for shrink lock              0          0          0         10          100723                      69335
          4        235 enq: TM - contention     1414332422      92275          0        600         6001589                         81
          5        798 Wait for shrink lock              0          0          0         10          100500                      69542
          6        235 enq: TM - contention     1414332422      92275          0        600         6002352                         86
          7        798 Wait for shrink lock              0          0          0         10          100618                      69145
          8        235 enq: TM - contention     1414332422      92275          0        600         6001545                        144
          9        798 Wait for shrink lock              0          0          0         10          100996                      69445
         10        235 enq: TM - contention     1414332422      92275          0        600         6002744                        310

The attempt to acquire the TM enqueue (lock) times out every three seconds – and I think the session then releases and re-acquires the SK lock before trying to re-acquire the TM lock – and it’s never going to get it.

The alert log, by the way, looked like this:


Wed Mar 12 12:53:27 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
Wed Mar 12 12:53:30 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
Wed Mar 12 12:53:40 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
Wed Mar 12 12:53:43 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.

I’ve said it before, and I keep repeating it when people say “Oracle resolves deadlocks automatically”: Oracle does NOT resolve deadlocks automatically – one of the sessions will rollback its last DML statement to clear the deadlock, but the other session will (almost invariably) still be waiting. It’s up to the application to do something sensible to resolve the deadlock after it receives the ORA-00060 error.

Don’t ask about the trace files – but they had both reached 400MB by the time I finished this note.

Strategy Note

If you are going to shrink objects, it’s probably best to do it in two steps: “shrink space compact, followed by “shrink space”.
If you’re going to try to use multiple sessions to shrink several indexes as quickly as possible, make sure there’s no way that two sessions can try to “shrink space” on the indexes on the same table at the same time.

Footnote:

This article was prompted by the recent appearance of bug 18388128 on MoS.


Parallel Plans

Tue, 2014-03-11 01:39

This is the directory for a short series I wrote discussing how to interpret parallel execution plans in newer versions of Oracle.

Temporary footnote: I’ll be repeating November’s 90 minute webinar on parallel execution plans on 15th April. I’ll also be doing a full-day webinar on trouble-shooting (spread over two days) on 16th/17th April.

For other aspects of parallel execution, here are links to several articles by Randolf Geist, published on his own blog or on Oracle’s Technet:

 

 

 


Duplicate database

Mon, 2014-03-10 12:19

One of the people attending my seminar in Munich last week has emailed me some details about a nasty little surprise you may get if you’re thinking about TSPITR (tablespace point in time recovery), and happen to have a few materialized views in your database.

You might have wanted to use the “duplicate target database”  from rman with the “skip tablespace” option to create a duplicate database, excluding tablespaces that you’re not interested in, if you needed to recover a particular tablespace (or set of tablespaces) to a specific point in time.  (Once you’ve done that you can extract the data that you want and then discard the duplicate database). But there’s a requirement for recovery is that the set of tablespaces should be “self-contained”; but what does “self-contained” mean ?

Amongst other things it means that none of the tablespace you skip should contain materialized views. Note carefully, that’s not saying the schema you’re interested in shouldn’t have created any materialized view, or the tablespace you want to duplicate contains a base table for a materialized view in another table; it really does mean – if you’ve got ANY materialized view ANYWHERE in the database, you have to duplicate those tablespaces as part of the process.

Here’s the restriction note from MoS (note the exclamation mark – maybe the Oracle analyst was surprised too):

Restriction Note: 
You MUST NOT exclude 
- SYS-owned objects 
- or tablespaces with rollback segments, 
- nor tablespaces containing “MATERIALIZED VIEWS”! 

Implementation suggestion – always put materialized views (and materialized view logs, and indexes on materialized views) in their own tablespace(s), just in case one day you want to do a tablespace point in time recovery and find you’ve got a few bits of materialized views scattered all around your database.

Footnote:

When I first heard this comment I didn’t believe it (but take a look at MoS document ID: 1287276.1 if you’re finding it hard to believe). Naturally my engine of doom went into overdrive immediately after I was convinced and made me wonder what would happen in a 12c container database with several plugged databases; obviously the existence of a materialized view in one plugged database shouldn’t have any impact on TSPITR for another pluggable database – but I wouldn’t mind if someone tested the hypothesis and reported back what they found.

Update:

It’s just occurred to me that this type of TSPITR problem simply won’t exist if you’re using Delphix as part of your working environment.


Parallel Execution – 5

Mon, 2014-03-10 07:30

In the last article (I hope) of this series I want to look at what happens when I change the parallel distribution method on the query that I’ve been using in my previous demonstrations.  This was a query first introduced in a note on Bloom Filters (opens in a separate window) where I show two versions of a four-table parallel hash join, one using using the broadcast distribution mechanism throughout, the other using the hash distribution method. For reference you can review the table definitions and plan (with execution stats) for the serial join in this posting (also opens in a separate window).

To change distribution methods from the broadcast example to the hash example I’ve simply changed a few hints in my code. Here are two sets of hints showing what I’ve done; the first is a repeat from the third article showing the broadcast example, the second shows the small change needed to get the hash example:


/*+
    leading(t4 t1 t2 t3)
    full(t4) parallel(t4, 2)
    use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 none broadcast)
    full(t1) parallel(t1, 2)
    use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 none broadcast)
    full(t2) parallel(t2, 2)
    use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 none broadcast)
    full(t3) parallel(t3, 2)
    monitor
*/

/*+
    leading(t4 t1 t2 t3)
    full(t4) parallel(t4, 2)
    use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 hash hash)
    full(t1) parallel(t1, 2)
    use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 hash hash)
    full(t2) parallel(t2, 2)
    use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 hash hash)
    full(t3) parallel(t3, 2)
    monitor
*/

Because of the combination of leading() hint with the use_hash() and swap_join_inputs() hints the plan WILL still build in-memory hash tables from t1, t2, and t3 and it WILL still probe each hash table in turn with the rows (that survive) from t4; but the order of activity in the hash distribution plan will be dramatically different from the order in the serial and parallel broadcast plans where the order in which Oracle actually built the in-memory hash tables t3, t2, t1.

Here – with a little cosmetic adjustment – is the parallel execution plan using hash distribution on 11.2.0.4, captured from memory with rowsource execution stats enabled (the 12c plan would report PX SEND HYBRID HASH” operators with an associated “STATISTICS COLLECTOR” operator showing that adaptive execution was a possibility – with three points at which the plan might switch from hash distribtion to broadcast):


--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |          |      1 |   437 (100)|          |        |      |            |      1 |00:00:00.08 |      16 |      5 |
|   1 |  SORT AGGREGATE                    |          |      1 |            |          |        |      |            |      1 |00:00:00.08 |      16 |      5 |
|   2 |   PX COORDINATOR                   |          |      1 |            |          |        |      |            |      2 |00:00:00.08 |      16 |      5 |
|   3 |    PX SEND QC (RANDOM)             | :TQ10006 |      0 |            |          |  Q1,06 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE                 |          |      2 |            |          |  Q1,06 | PCWP |            |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN                     |          |      2 |   437   (3)| 00:00:03 |  Q1,06 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|   6 |       JOIN FILTER CREATE           | :BF0000  |      2 |     2   (0)| 00:00:01 |  Q1,06 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|   7 |        PX RECEIVE                  |          |      2 |     2   (0)| 00:00:01 |  Q1,06 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|   8 |         PX SEND HASH               | :TQ10004 |      0 |     2   (0)| 00:00:01 |  Q1,04 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|   9 |          PX BLOCK ITERATOR         |          |      2 |     2   (0)| 00:00:01 |  Q1,04 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 10 |           TABLE ACCESS FULL        | T3       |      2 |     2   (0)| 00:00:01 |  Q1,04 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  11 |       PX RECEIVE                   |          |      2 |   435   (3)| 00:00:03 |  Q1,06 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|  12 |        PX SEND HASH                | :TQ10005 |      0 |   435   (3)| 00:00:03 |  Q1,05 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  13 |         JOIN FILTER USE            | :BF0000  |      2 |   435   (3)| 00:00:03 |  Q1,05 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|* 14 |          HASH JOIN BUFFERED        |          |      2 |   435   (3)| 00:00:03 |  Q1,05 | PCWP |            |    630 |00:00:00.01 |       0 |      0 |
|  15 |           JOIN FILTER CREATE       | :BF0001  |      2 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  16 |            PX RECEIVE              |          |      2 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  17 |             PX SEND HASH           | :TQ10002 |      0 |     2   (0)| 00:00:01 |  Q1,02 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  18 |              PX BLOCK ITERATOR     |          |      2 |     2   (0)| 00:00:01 |  Q1,02 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 19 |               TABLE ACCESS FULL    | T2       |      2 |     2   (0)| 00:00:01 |  Q1,02 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  20 |           PX RECEIVE               |          |      2 |   432   (3)| 00:00:03 |  Q1,05 | PCWP |            |    632 |00:00:00.01 |       0 |      0 |
|  21 |            PX SEND HASH            | :TQ10003 |      0 |   432   (3)| 00:00:03 |  Q1,03 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  22 |             JOIN FILTER USE        | :BF0001  |      2 |   432   (3)| 00:00:03 |  Q1,03 | PCWP |            |    632 |00:00:00.09 |       0 |      0 |
|* 23 |              HASH JOIN BUFFERED    |          |      2 |   432   (3)| 00:00:03 |  Q1,03 | PCWP |            |  14700 |00:00:00.09 |       0 |      0 |
|  24 |               JOIN FILTER CREATE   | :BF0002  |      2 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  25 |                PX RECEIVE          |          |      2 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  26 |                 PX SEND HASH       | :TQ10000 |      0 |     2   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  27 |                  PX BLOCK ITERATOR |          |      2 |     2   (0)| 00:00:01 |  Q1,00 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 28 |                   TABLE ACCESS FULL| T1       |      2 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  29 |               PX RECEIVE           |          |      2 |   427   (2)| 00:00:03 |  Q1,03 | PCWP |            |  14700 |00:00:00.08 |       0 |      0 |
|  30 |                PX SEND HASH        | :TQ10001 |      0 |   427   (2)| 00:00:03 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  31 |                 JOIN FILTER USE    | :BF0002  |      2 |   427   (2)| 00:00:03 |  Q1,01 | PCWP |            |  14700 |00:00:00.05 |    6044 |   6018 |
|  32 |                  PX BLOCK ITERATOR |          |      2 |   427   (2)| 00:00:03 |  Q1,01 | PCWC |            |  14700 |00:00:00.04 |    6044 |   6018 |
|* 33 |                   TABLE ACCESS FULL| T4       |     26 |   427   (2)| 00:00:03 |  Q1,01 | PCWP |            |  14700 |00:00:00.04 |    6044 |   6018 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T3"."ID"="T4"."ID3")
  10 - access(:Z>=:Z AND :Z<=:Z)        filter((TO_NUMBER("T3"."SMALL_VC")=1 OR TO_NUMBER("T3"."SMALL_VC")=2 OR TO_NUMBER("T3"."SMALL_VC")=3))   14 - access("T2"."ID"="T4"."ID2")   19 - access(:Z>=:Z AND :Z<=:Z)        filter((TO_NUMBER("T2"."SMALL_VC")=1 OR TO_NUMBER("T2"."SMALL_VC")=2 OR TO_NUMBER("T2"."SMALL_VC")=3))   23 - access("T1"."ID"="T4"."ID1")   28 - access(:Z>=:Z AND :Z<=:Z)        filter((TO_NUMBER("T1"."SMALL_VC")=1 OR TO_NUMBER("T1"."SMALL_VC")=2 OR TO_NUMBER("T1"."SMALL_VC")=3))   33 - access(:Z>=:Z AND :Z<=:Z)
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T4"."ID1"))

There are a couple of significant points that are very easy to point out in this plan. First, we have a number of lines which are “BLOOM FILTER CREATE/USE” lines that did not appear in the broadcast plan; second that we can only see one sys_op_bloom_filter() in the predicate section rather than three (don’t worry, it’s – partly – a reporting defect); finally we have seven virtual tables (table queues :TQnnnnn) in this plan rather than four, and those virtual tables seems to be scattered rather more randomly around the plan.

To make it easier to understand what’s happened with a parallel execution plan, I usually also dump out the contents of v$pq_tqstat after running the query – so here’s the result after running the above:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                     3         69          1          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                     2         62         30         16           0
                                             1 P001                     1         55         26         14           0

                    1 Producer               1 P002                  1476      35520          2          1           0
                                             1 P003                 13224     317880          1          0           0
                      Consumer               1 P000                  9800     235584         20         14           0
                                             1 P001                  4900     117816         20         14           0

                    2 Producer               1 P000                     3         69          0          0           0
                                             1 P001                     0         48          0          0           0
                      Consumer               1 P002                     2         62         33         19           0
                                             1 P003                     1         55         32         19           0

                    3 Producer               1 P000                   422       9754          0          0           0
                                             1 P001                   210       4878          0          0           0
                      Consumer               1 P002                   420       9708         33         19           0
                                             1 P003                   212       4924         32         18           0

                    4 Producer               1 P002                     3         69          1          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                     2         62         42         20           0
                                             1 P001                     1         55         39         15           0

                    5 Producer               1 P002                    18        444          0          0           0
                                             1 P003                     9        246          0          0           0
                      Consumer               1 P000                    18        444         41         20           0
                                             1 P001                     9        246         39         16           0

                    6 Producer               1 P000                     1         60          0          0           0
                                             1 P001                     1         60          0          0           0
                      Consumer               1 QC                       2        120          1          0           0

So let’s work our way through the execution plan – if you want to put the plan and my comments side by side, this link will re-open this article in a second window.

Given the set of hints, and the intent I expressed at the start of the series, we hope to see Oracle building an in-memory hash table from each of t1, t2 and t3 in that order, following which it will scan t4, probe t3, t2, and t1 in that order, and then aggregate the result.  Let’s check that using the parallel plan rule of “follow the table queues”.

Table queue 0 covers lines 26 – 28, we scan t1 and distribute it by hash.  We can see from the A-Rows column we found 3 rows and distributed them and if we look at the output from v$pq_tqstat we find it matches - slaves 2 and 3 produced 3 rows, slaves 0 and 1 consumed 3 rows. Table queue 1 covers lines 30 – 33, we scan t4 and distribute it by hash. We can see from the A-rows column we found 14,700 rows and distributed them, and again we can see the match in v$pq_tqstat – slaves 2 and 3 produced 14,700 rows and distributed them to slaves 0 and 1. But there’s an oddity here, and things start to  get messy: from the predicate section we can see that we applied a Bloom filter on the ID1 column on the data we got from the tablescan, and the plan itself shows a Bloom filter (:BF0002) being used at line 31, but that Bloom filter is created at line 24 of the plan and line 24 has been associated with table queue 3. Now I know (because I constructed the data) that a perfect filter has been created and used at that point because 14,700 rows is exactly the volume of data that should eventually join between tables t1 and t4.  It’s reasonable, I think, to say that the boundary between table queues 0 and 3 is a little blurred at lines 24/25 – the slaves that are going to populate table queue 3 are the ones that created the Bloom filter, but they’re not going to populate table queue 3 just yet.

So let’s move on to table queue 2. This covers lines 17-19 (looking at the TQ column) except I’m going to assume the same blurring of boundaries I claimed for table queue 0 – I’m going to say that table queue 2 expands into lines 15-19 (bringing in the PX RECEIVE and JOIN FILTER CREATE (:BF001). So our next step is to scan and distribute table t2, and build a Bloom filter from it. Again we look at v$pq_tqstat and see that in this case it’s slaves 0 and 1 which scan the table and distribute 3 rows to slaves 2 and 3, and we assume that slaves 2 and 3 will send a Bloom filter back to salves 0 and 1.

Now we can move on to table queue 3: line 21 writes to table queue 3 by using lines 22, 23, 24, 25, and 29 according to the TQ column (but thanks to the blurring of the boundaries lines 24 and 25 were used “prematurely” to create the Bloom filter :BF002 describing the results from table t1). So lines 24/25 read table queue 0 and built an in-memory hash table, simultaneously creating a Bloom filter and sending it back to slaves 2 and 3; then line 23 did a HASH JOIN BUFFERED, which means it copied the incoming data from table queue 1 (slaves 2 and 3, table t4)  into a buffer and then used that buffer to probe its in-memory hash table and do the join; then line 22 applied a Bloom filter (:BF001) to the result of the hash join although the filter won’t appear in the predicate section until version 12.1.0.1. Notice that line 23 (the join) produced 14,700 rows, demonstrating that our previous filter was a perfect filter, and then line 22 filtered out all but 632 rows. (Again, because I constructed the data I can tell you that the second Bloom filter has also worked with 100% accuracy – although v$pq_tqstat seems to show an extra 2 rows which I can’t account for and which don’t appear in the trace file).

So here’s another problem – we’re using another Bloom filter that we haven’t yet (apparently) created unless we accept my assumption of the blurring of the boundary at lines 15 and 16, where the plan shows two lines associated with table queue 5 even though I need them to be associated with table queue 2 so that they can produce the Bloom filter needed by table queue 3. Again, by the way, we can do the cross-check with the TQ_ID 3 of v$pq_tqstat abnd see slaves 0 and 1 produced 632 rows and sent them to slaves 2 and 3.

Before continuing, lets rewrite the action so far as a series of bullet points:

  • Slaves 2,3 scan t1 and distribute to slaves 0,1
  • Slaves 0,1 build an in-memory hash table and a Bloom filter (:BF002) for t1, and send the filter to slaves 2,3
  • Slaves 2,3 scan t4, use the Bloom filter (:BF002) to eliminate data (luckily 100% perfectly) and distribute the remaining rows to slaves 0,1
  • Slaves 0,1 buffer the incoming data
  • Slaves 0,1 scan t2 and distribute to slaves 2,3
  • Slaves 2,3 build an in-memory hash table for the results from t2 and a Bloom filter (:BF001) for t2, and send the filter to slaves 0,1
  • Slaves 0,1 use the buffered t4 to probe the in-memory hash of t1 to do the join, testing join results  against the Bloom filter (:BF001) for t2, and distributing the surviving rows to slaves 2,3

The pattern of the last four steps will then repeat for the next hash join – and for longer joins the patten will repeat up to, but excluding, the last join.

  • Slaves 2,3 buffer the incoming data (the result of joining t4, t1 and t2) – the buffering is implied by line 4 (which is labelled as an input for table queue 5)
  • Slaves 2,3 scan t3 and distribute to slaves 0,1 (reading lines 8,9,10 of the plan), cross-checking with TQ_ID 4 of v$pq_tqstat
  • Slaves 0,1 build an in-memory hash table for the results from t3 and a Bloom filter (:BF000) for t3, and send the filter to slaves 2,3 (“sharing” lines 6 and 7 from table queue 6)
  • Slaves 2,3 use the buffered results from (t4/t1) to probe the in-memory hash to t2 to do the join, testing join results against the Bloom filter (:BF000) for t3, and distributing the surviving rows to slaves 0,1.

Again, we can check row counts – the hash join buffered at line 14 shows 630 rows coming from the hash join (i.e. the previous Bloom filter was perfect), and line 13 shows 27 rows surviving the final Bloom filter. Again my knowledge of the data tells me that the Bloom filter was a perfect filter. Cross-checking to TQ_ID 5 of v$pq_tqstat we see slaves 2 and 3 producing 27 rows and slaves 0 and 1 consuming them.

So at this point slaves 0,1 have an in-memory hash table for t3, and are receiving the filtered results of the join between t4, t1, and t2; the slaves have to join and aggregate the the two data sets before forwarding a result to the query co-ordinator. Since the aggregation is a blocking operation (i.e. slaves 0,1 can send data to the co-ordinator until they’ve emptied virtual table 5 and aggregated all the incoming data) they don’t have to use the “hash join buffered” mechanism, so the pattern for the final part of the plan changes.

Lines 5, 6, 7, 11 show us the hash join (not buffered) with its two inputs (although lines 6 and 7 have, of course, been mentioned once already as the source of the Bloom filter used at line 13). Then line 4 shows slaves 0 and 1 aggregating their results; line 3 shows them forwarding the results to the query co-ordinator, line 2 shows the query co-ordinator receiving the results and line 1 shows it aggregating across the slave results ready to send to the end-user.

It’s a bit complicated, and the constant jumping back and fore through the execution plan lines (especially for the “shared” usage of the Bloom filter creation lines) makes it quite hard to follow, so I’ve drawn up a Powerpoint slide to capture the overall picture:
px_plan

I’ve put the slaves 0 and 1 at the top of the picture, slaves 2 and 3 at the bottom, with the query co-ordinator in the middle at the right hand side. Time reads across the page from left to right, and that gives you the order in which data moves through table queues (and back, for Bloom filters). The annotation give you some idea of what data is moving. Note that I’ve used B1 to refer to the Bloom filter on table T1 (and ignored the numbering on Oracle’s :BFnnn entries). I’ve used red to highlight the data sets that are buffered, and put in curved arrows to show where the buffered data is subsequently brought back into play. I did try to add the various plan line numbers to the picture, but the volume of text made the whole thing incomprehensible – so I’ve left it with what I think is the best compromise of textual information and graphical flow.

I’ll just leave one final warning – if you want to reproduce my results, you’ll have to be careful about versions. I stuck with 11.2.0.4 as that’s the latest version of the most popular general release. There are differences in 12.1.0.1, and there are differences again if you try to emulate 11.2.0.4 by setting the optimizer_features_enable in 12.1.0.1 back to the earlier version.


Subquery Anomaly

Fri, 2014-03-07 02:57

Here’s an oddity that appeared on the OTN database forum last night:

We have this query in our application which works fine in 9i but fails in 11gR2 (on Exadata) giving an “ORA-00937: not a single-group group function” error….

… The subquery is selecting a column and it doesn’t have a group by clause at all. I am not sure how is this even working in 9i. I always thought that on a simple query using an aggregate function (without any analytic functions / clause), we cannot select a column without having that column in the group by clause. So, how 11g behaves was not a surprise but surprised to see how 9i behaves. Can someone explain this behaviour?

The poster supplied the suspect query, and it certainly looked as if it should never have worked – but I took a guess that the optimizer was doing some sort of transformation that concealed the problem before the optimizer managed to see the error. The subquery was a little odd because it was doing something it didn’t need to do, and my was guess that the optimizer had recognised the option to simplify the query and the simplification had “accidentally” removed the error. This turned out to be correct, but my guess about exactly what had happened to hide the error was wrong.

Having created a hypothesis I couldn’t resist checking it this morning, so here’s the test case (don’t pay any attention to the actual data I’ve generated, it was a cut-n-paste from a script that I had previously used for something completely different):

create table t1
as
select
	trunc((rownum-1)/15)	n1,
	trunc((rownum-1)/15)	n2,
	rpad(rownum,180)	v1
from all_objects
where rownum <= 3000
;

create table t2
as
select
	mod(rownum,200)		n1,
	mod(rownum,200)		n2,
	rpad(rownum,180)	v1
from all_objects
where rownum <= 3000
;

begin
 	dbms_stats.gather_table_stats(
 		user,
 		't1',
 		method_opt => 'for all columns size 1'
	);

	dbms_stats.gather_table_stats(
		user,
		't2',
		method_opt => 'for all columns size 1'
	);
end;
/

explain plan for
select
	/*+ qb_name(main) */
	*
from t1
where (n2,n1) in (
	select /*+
			qb_name(subq)
			unnest
		*/
		max(t2.n2), t2.n1
	from t2
	where t2.n1 = t1.n1
)
;

You’ll notice, of course, that I don’t have a group by clause at all, so the presence of the t2.n1 in the select list should lead to Oracle error: “ORA-00937: not a single-group group function”.

In versions from 8i to 11.1.0.7, this query could run, and its execution plan looked looked like this:


----------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost  |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |   200 | 45200 |    46 |
|*  1 |  HASH JOIN           |         |   200 | 45200 |    46 |
|   2 |   VIEW               | VW_SQ_1 |   200 |  7800 |    31 |
|   3 |    HASH GROUP BY     |         |   200 |  2400 |    31 |
|   4 |     TABLE ACCESS FULL| T2      |  3000 | 36000 |    14 |
|   5 |   TABLE ACCESS FULL  | T1      |  3000 |   547K|    14 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("N2"="MAX(T2.N2)" AND "N1"="N1" AND "ITEM_1"="T1"."N1")

Notice how the optimizer has produced an inline view (VW_SQ_1) from the subquery, using it to drive a hash join; notice how that inline view has an aggregation operation (HASH GROUP BY) in it. In effect the optimizer has rewritten my query like this:

select
	t1.*
from	(
		select
			distinct max(t2.n2) max_n2, t2.n1 item_1, t2.n1
		from	t2
		group by
			t2.n1
	)	vw_sq_1,
	t1
where
	t1.n2 = vw_sq_1.max_n2
and	t1.n1 = vw_sq_1.n1
and	t1.n1 = vw_sq_1.item_1
;

There’s a clue about why this succeeded in the 10053 trace file, which includes the lines:

"Subquery Unnesting on query block SEL$1 (#1)SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$1 (#1).
SU:   Checking validity of unnesting subquery SEL$2 (#2)
SU:   Passed validity checks.

Compared to the 11.2 lines:

Subquery Unnesting on query block MAIN (#1)SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block MAIN (#1).
SU:   Checking validity of unnesting subquery SUBQ (#2)
SU:     SU bypassed: Failed basic validity checks.
SU:   Validity checks failed.

Whatever check it was that Oracle introduced in 11.2 (maybe a check that the query block was inherently legal), unnesting failed – and if I add an /*+ no_unnest */ hint to the original subquery in the earlier versions of Oracle I get the expected ORA-00937.

The philosophical argument is left to the reader: was the original behaviour a bug, or is the new behaviour the bug ?

 


12c pq_replicate

Wed, 2014-03-05 11:39

Another day, another airport lounge – another quick note: one of the changes that appeared in 12c was a tweak to the “broadcast” distribution option of parallel queries. I mentioned this in a footnote to a longer article a couple of months ago; this note simply expands on that brief comment with an example. We’ll start with a simple two-table hash join – which I’ll first construct and demonstrate in 11.2.0.4:


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	rownum				n1,
	lpad(rownum,6,'0')		small_vc,
	lpad(rownum,200,'0')		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1000
;

create table t2
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	1 + mod(rownum,10000)			n1,
	lpad(1 + mod(rownum,10000),6,'0')	small_vc,
	lpad(rownum,500,'0')			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 20000 ;

-- collect stats, no histograms.

select
  	/*+
  		leading(t1 t2)
 		parallel(t1 2)
 		parallel(t2 2)
 		use_hash(t2)
 	*/
 	t1.padding,
 	t2.padding
from 	t1, t2
where	t2.n1 = t1.n1
and	t2.small_vc = t1.small_vc
;

-------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  1000 |   707K|   135 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN            |          |  1000 |   707K|   135 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   207K|     4 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST  | :TQ10000 |  1000 |   207K|     4 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   207K|     4 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR   |          | 20000 |     9M|   131 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS FULL  | T2       | 20000 |     9M|   131 |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------------------------------

In this plan slave set 2 scans table t1 in parallel and broadcasts the result set to slave set 1 (lines 5 – 7). The significance of the broadcast option is that each slave in slave set 2 sends all the rows it has read to every slave in slave set 1. For a fairly large table with a high degree of parallelism this could be a lot of inter-process communication; the total number of rows passing through the PX message pool is “DOP x number of row filtered from t1″.

After a slave in slave set 1 has receive the whole of the t1 result set it builds an in-memory hash table and starts scanning rowid ranges (PX BLOCK ITERATOR) from table t2, probing the in-memory hash table to effect the join (lines 3,4, 8,9). Since each slave has a copy of the whole result set from t1 it can scan any chunk of t2 and handle the contents locally. Moreover, because slave set 1 isn’t reading its second input from a virtual table it is able to write its output immediately the virtual table (:TQ10001) that feeds the query coordinator with the result (lines 1,2) – we don’t have to do a “hash join buffered” operation and buffer the entire second input before starting to execute the join.

So how does 12c change things. With the same starting data and query, here’s the execution plan:

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  1000 |   707K|   135 |  Q1,00 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN          |          |  1000 |   707K|   135 |  Q1,00 | PCWP |            |
|   4 |     TABLE ACCESS FULL | T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   5 |     PX BLOCK ITERATOR |          | 20000 |     9M|   131 |  Q1,00 | PCWC |            |
|   6 |      TABLE ACCESS FULL| T2       | 20000 |     9M|   131 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------

Notice, in particular, that we only have one virtual table (or table queue :TQ10000) rather than two – and that’s from a parallel query slave set to the query co-ordinator, parallel to serial; the query only uses one set of parallel query slaves. Until you run the query with rowsource execution statistics enabled and look at the output from v$pq_tqstat it’s not going to be immediately obvious what has happened, but we should see that somehow Oracle is no longer broadcasting the first table even though it’s still doing something in parallel with both tables.

The run-time statistics confirm that we’ve only used one set of slaves, and each slave in the slave set has scanned the whole of table t1. This means each slave can build the full hash table and then go on to read rowid ranges from table t2. We’ve managed to get the benefit of broadcasting t1 (every slave has the whole of t1 so we don’t have to scan and distribute the big table t2 through the PX message pool) but we haven’t had to clone it multiple times through the PX message pool.

Clearly there’s a trade-off here that Oracle Corp. has decided is worth considering. I’m guessing it’s biased towards Exadata where you might run queries with a very high degree of parallelism. In that case the overhead of task switching as large numbers of messages are passed around may (and this is pure supposition) be greater than the added cost of loading the table into the buffer cache (of each instance) and having each slave scan it from there. (Reminder – 11g introduced two “opposite” changed to tablescans: “serial direct reads” and “parallel in-memory scans”.)

There’s one little oddity in this replication – there’s a pair of hints: pq_replicate and no_pq_replicate to control the effect if you think the optimizer is making the wrong choice. I would have guessed that in my example the hint would read: /*+ pq_replicate(t1) */ as it’s table t1 that is read by every single slave. Strangely, though, this is what the outline section of the execution plan showed:


  /*+
      BEGIN_OUTLINE_DATA
      PQ_REPLICATE(@"SEL$1" "T2"@"SEL$1")
      PQ_DISTRIBUTE(@"SEL$1" "T2"@"SEL$1" BROADCAST NONE)
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      OPT_PARAM('_optimizer_cost_model' 'io')
      DB_VERSION('12.1.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Notice how the hint specifies table t2, not table t1 !

Footnote

Here’s a little anomaly,  and a generic warning about “optimizer_features_enable”: I found that if I used the hint /*+ optimizer_features_enable(’11.2.0.4′) */ in 12c I could still get the pq_replicate() hint to work. Unfortunately there are a few places where the hint (or parameter) isn’t guaranteed to take the optimizer code backwards the full 100%.


Flashback Fail ?

Mon, 2014-03-03 10:19

Sitting in an airport, waiting for a plane, I decided to read a note (pdf) about Flashback data archive written by Beat Ramseier from Trivadis.  I’d got about three quarters of the way through it when I paused for thought and figured out that on the typical database implementation something nasty is going to happen after approximately 3 years and 9 months.  Can you guess why ?

It’s all about smon_scn_time – which normally records one row every five minutes (created by smon) with a continuous cycle of 24 hours – typically giving you about 1,440 rows in the table. The table is in a cluster, and the cluster key is the instance (thread) number. Clearly this was originally a clever idea from someone who realised that a cluster key of thread number would be beneficial if you had a RAC system with multiple instances – each instance gets its own blocks and the data for any one instance is as well clustered as possible.

The trouble is, when you enable flashback data archive smon no longer sticks to a 24 hour cycle, it just keeps adding rows. Now on my 8KB block tablespace I see 6 rows per block in the table/cluster – which means I get through 48 blocks per days,  17,520 blocks per year, and in 3 years and 9 months I’ll get to roughly 65,700 blocks – and that’s the problem.  An index entry in a cluster index points to a chain of cluster blocks, and the last two bytes of the “rowid” in the index entry identify which block within the chain the cluster key scan should start at – and two bytes means you can only have 65,536 blocks for a single cluster key.

I don’t know what’s going to happen when smon tries to insert a row into the 65,535th (-ish) block for the current thread – but it ought to raise an Oracle error, and then you’ll probably have to take emergency action to make sure that the flashback mechanisms carry on running.

Although oraus.msg indicates that it’s an error message about hash clusters it’s possible that the first sight will be: Oracle error: “ORA-02475 maximum cluster chain block count of %s has been exceeded”. If you’re using a 16KB block size then you’ve got roughly 7.5 years, and 32KB block sizes give you about 15 years (not that that’s a good argument for selecting larger block sizes, of course.)

Footnote:

Searching MoS for related topics (smon_scn_time flashback) I found doc ID: 1100993.1 from which we could possibly infer that the 1,440 rows was a fixed limit in 10g but that the number of rows allowed in smon_scn_time could increase in 11g if you enable automatic undo management. I also found a couple of bugs relating to index or cluster corruption – fixed by 11.2.0.4, though.

 

 


Auto Sample Size

Sun, 2014-03-02 12:38

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:


create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)
    )
;

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.

So what happens when Oracle collects table statistics – if you’ve enable the approximate NDV feature Oracle does a 100% sample, which means it has to call the function for every single row in the table. You will appreciate that the decode(instr()) function on the LOB column is going to read every single LOB in turn from the table – it’s not surprising that the time taken to calculate stats on the table jumped from a few minutes to a couple of hours. What did surprise me was that my call to dbms_lob.getlength() also seemed to read every lob in my example rather than reading the “LOB Locator” data that’s stored in the row – one day I’ll take a look into why that happened.

Looking at these examples it’s probably safe to conclude that if you really need to index some very small piece of “flag” information from a LOB it’s probably best to store it as a real column on the table – perhaps populated through a trigger so you don’t have to trust every single piece of front-end code to keep it up to date. (It would be quite nice if Oracle gave us the option for a “derived” column – i.e. one that could be defined in the same sort of way as a virtual column, with the difference that it should be stored in the table.)

So virtual columns based on LOBs can create a performance problem for the approximate NDV mechanism;  but the story doesn’t stop there because there’s another “less commonly used” feature of Oracle that introduces a different threat – with no workaround – it’s the index organized table (IOT). Here’s a basic example:

create table iot1 (
        id1	number(7,0),
	id2	number(7,0),
	v1	varchar2(10),
	v2	varchar2(10),
	padding	varchar2(500),
        constraint iot1_pk primary key(id1, id2)
)
organization index
including id2
overflow
;

insert into iot1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
        mod(rownum,20)                  id1,
        trunc(rownum,100)               id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum,100))      v2,
        rpad('x',500,'x')               padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5
;

commit;

alter system flush buffer_cache;

alter session set events '10046 trace name context forever';

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'IOT1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

alter session set events '10046 trace name context off';

You’ll notice I’ve created the table then inserted the data – if I did a “create table as select” Oracle would have sorted the data before inserting it, and that would have helped to hide the problem I’m trying to demonstrate. As it is my overflow segment is very badly ordered relative to the “top” (i.e. index) segment – in fact I can see after I’ve collected stats on the table that the clustering_factor on the index is 100,000 – an exact match for the rows in the table.

Running 11.2.0.4, with a 1MB uniform extent, freelist management, and 8KB block size the index segment held 279 leaf blocks, the overflow segment (reported in view user_tables as SYS_IOT_OVER_81594) held 7,144 data blocks.

So what interesting things do we find in a 10046 trace file after gathering stats – here are the key details from the tkprof results:

SQL ID: 7ak95sy9m1s4f Plan Hash: 1508788224

select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats
  cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
  no_substrb_pad  */to_char(count("ID1")),to_char(substrb(dump(min("ID1"),16,
  0,32),1,120)),to_char(substrb(dump(max("ID1"),16,0,32),1,120)),
  to_char(count("ID2")),to_char(substrb(dump(min("ID2"),16,0,32),1,120)),
  to_char(substrb(dump(max("ID2"),16,0,32),1,120)),to_char(count("V1")),
  to_char(substrb(dump(min("V1"),16,0,32),1,120)),
  to_char(substrb(dump(max("V1"),16,0,32),1,120)),to_char(count("V2")),
  to_char(substrb(dump(min("V2"),16,0,32),1,120)),
  to_char(substrb(dump(max("V2"),16,0,32),1,120)),to_char(count("PADDING")),
  to_char(substrb(dump(min("PADDING"),16,0,32),1,120)),
  to_char(substrb(dump(max("PADDING"),16,0,32),1,120))
from
 "TEST_USER"."IOT1" t  /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,
  NIL,NIL*/

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.37       0.37       7423     107705          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.37       0.37       7423     107705          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=107705 pr=7423 pw=0 time=377008 us)
    100000     100000     100000   APPROXIMATE NDV AGGREGATE (cr=107705 pr=7423 pw=0 time=426437 us cost=10 size=23944 card=82)
    100000     100000     100000    INDEX FAST FULL SCAN IOT1_PK (cr=107705 pr=7423 pw=0 time=298380 us cost=10 size=23944 card=82)(object id 85913)

********************************************************************************

SQL ID: 1ca2ug8s3mm5z Plan Hash: 2571749554

select /*+  no_parallel_index(t, "IOT1_PK")  dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad
  no_expand index(t,"IOT1_PK") */ count(*) as nrw,count(distinct
  sys_op_lbid(85913,'L',t.rowid)) as nlb,null as ndk,
  sys_op_countchg(sys_op_lbid(85913,'O',"V1"),1) as clf
from
 "TEST_USER"."IOT1" t where "ID1" is not null or "ID2" is not null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.16       0.16          0     100280          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.16       0.16          0     100280          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT GROUP BY (cr=100280 pr=0 pw=0 time=162739 us)
    100000     100000     100000   INDEX FULL SCAN IOT1_PK (cr=100280 pr=0 pw=0 time=164597 us cost=6 size=5900000 card=100000)(object id 85913)

The first query collects table and column stats, and we can see that the approximate NDV method has been used because of the trailing text: /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/. In this statement the hint /*+ full(t) */ has been interpreted to mean an index fast full scan, which is what we see in the execution plan. Although there are only 279 blocks in the index and 7,144 blocks in the overflow we’ve done a little over 100,000 buffer visits because for every index entry in the IOT top we’ve done a “fetch by rowid” into the overflow segment (the session stats records these as “table fetch continued row”). Luckily I had a small table so all those visits were buffer gets; on a very large table it’s quite possible that a significant fraction of those buffer gets will turn into single block physical reads.

Not only have we done one buffer visit per row to allow us to calculate the approximate NDV for the table columns, we’ve done the same all over again so that we can calculate the clustering_factor of the index. This is a little surprising since the “rowid” for an item in the overflow section is stored in the index segment but (as you can see in the second query in the tkprof output) Oracle has used column v1 (the first in the overflow segment) in the call to the sys_op_countchg() function where the equivalent call for an ordinary index would use t.rowid so, presumably, the code HAS to access the overflow segment. The really strange thing about this is that the same SQL statement has a call to sys_op_lbid() which uses the (not supposed to exist in IOTs) rowid – so it looks as if it ought to be possible for sys_op_countchg() to do the same.

So – big warning on upgrading to 11g: if you’ve got IOTs with overflows and you switch to auto_sample_size and enable approximate NDV then the time taken to gather stats on those IOTs may (depending to a large extent on the data clustering) take much longer than it used to.


Empty Hash

Fri, 2014-02-28 12:45

A little while ago I highlighted a special case with the MINUS operator (that one of the commentators extended to include the INTERSECT operator) relating to the way the second subquery would take place even if the first subquery produced no rows. I’ve since had an email from an Oracle employee letting me know that the developers looked at this case and decided that it wasn’t feasible to address it because – taking a wider view point – if the query were to run parallel they would need a mechanism that allowed some synchronisation between slaves so that every slave could find out that none of the slaves had received no rows from the first subquery, and this was going to lead to hanging problems.

The email reminded me that there’s another issue of the same kind that I discovered several years ago – I thought I’d written it up, but maybe it was on a newsgroup or forum somewhere, I can’t find it on my blog or old website). The problem can be demonstrated by this example:


create table t1
as
select
	rownum			id,
	mod(rownum,25)		rep_col,
	rpad('x',50)		padding
from
	all_objects
where
	rownum <= 3000
;

delete from t1 where rep_col = 12;

create table t2
as
select
	rownum			id,
	mod(rownum,25)		rep_col,
	rpad('x',50)		padding
from
	all_objects
where
	rownum <= 10000
;

-- collect stats, no histograms

select
	/*+
		leading(t1 t2)
		use_hash(t2)
		no_swap_join_inputs(t2)
		pq_distribute(t2, hash, hash)
	*/
	count(*)
from
	t1,t2
where
	t1.rep_col = 12
and	t2.id = t1.id
;

You’ll notice that I’ve created a data set for table t1 where the values 12 does not appear in column rep_col; so my query will return no rows. However, for the purposes of demonstration, I’ve hinted an execution path that will scan t1 and distribute it by hash, then scan t2 to distribute that by hash before doing the join. Here’s the plan – which I’ve generated with rowsource execution statistics enabled and pulled from memory after executing the query:


-------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Starts | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   1 |  SORT AGGREGATE            |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   2 |   PX COORDINATOR           |          |      1 |      2 |00:00:00.01 |       6 |      2 |
|   3 |    PX SEND QC (RANDOM)     | :TQ10002 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE         |          |      2 |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN             |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   6 |       JOIN FILTER CREATE   | :BF0000  |      2 |      0 |00:00:00.01 |       0 |      0 |
|   7 |        PX RECEIVE          |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   8 |         PX SEND HASH       | :TQ10000 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   9 |          PX BLOCK ITERATOR |          |      2 |      0 |00:00:00.01 |      54 |     27 |
|* 10 |           TABLE ACCESS FULL| T1       |     27 |      0 |00:00:00.01 |      54 |     27 |
|  11 |       PX RECEIVE           |          |      0 |      0 |00:00:00.01 |       0 |      0 |
|  12 |        PX SEND HASH        | :TQ10001 |      0 |      0 |00:00:00.01 |       0 |      0 |
|  13 |         JOIN FILTER USE    | :BF0000  |      2 |      2 |00:00:00.01 |     116 |     87 |
|  14 |          PX BLOCK ITERATOR |          |      2 |      2 |00:00:00.01 |     116 |     87 |
|* 15 |           TABLE ACCESS FULL| T2       |     29 |      2 |00:00:00.01 |     116 |     87 |
-------------------------------------------------------------------------------------------------

Since this was 11.2.0.4 Oracle has used Bloom filtering to reduce the traffic between slave sets – but you can see that despite returning no rows from t1 (lines 7 – 10), Oracle still performed the parallel tablescan of t2 (lines 12 – 15). Thanks to the Bloom filter we don’t transfer 10,000 rows between slave sets, but we can see from the Buffers and Reads columns that we really did do the tablescan – and if we take a snapshot of instance activity we would have seen 10,000 rows fetched by tablescan at that point.

If I ran this query serially Oracle would stop after discovering that the first tablescan returned no rows – why bother scanning the probe table when the hash table is empty ? But because the query is running in parallel, a single slave that receives no data from the first tablescan cannot assume that every other slave in the same set has also received no data – there’s no cross-chat that allows the slaves to discover that every slave has no data – so the second scan goes ahead.

I was a little surprised by this when I first found it since I thought (from looking at some of the 1039x trace information) that the parallel slaves were telling the query coordinator how many rows they had acquired on each granule – which would allow the coordinator to spot the zero total. But it looks as if I was misinterpreting the trace.

On the plus side for this example – it’s probably the case that if zero is a possible volume of data returned by the query then the optimizer will have decided that it was going to get a “small” data set for the build table and therefore do a broadcast distribution – and if that happens the second tablescan won’t occur – as we see below (note the zero Reads, zero Buffers):


------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Starts | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   1 |  SORT AGGREGATE           |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   2 |   PX COORDINATOR          |          |      1 |      2 |00:00:00.01 |       6 |      2 |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE        |          |      2 |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN            |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   6 |       PX RECEIVE          |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   7 |        PX SEND BROADCAST  | :TQ10000 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   8 |         PX BLOCK ITERATOR |          |      2 |      0 |00:00:00.01 |      54 |     27 |
|*  9 |          TABLE ACCESS FULL| T1       |     27 |      0 |00:00:00.01 |      54 |     27 |
|  10 |       PX BLOCK ITERATOR   |          |      0 |      0 |00:00:00.01 |       0 |      0 |
|* 11 |        TABLE ACCESS FULL  | T2       |      0 |      0 |00:00:00.01 |       0 |      0 |
------------------------------------------------------------------------------------------------

Unfortunately the same type of argument can’t be used to dismiss the minus/intersect cases.

 


Parallel Execution – 4

Wed, 2014-02-26 07:58

I’m aware that in the previous article in this series I said I’d continue “in a few days” and it has now been more like 11 weeks – but finally I’ve got the time. In this article I’m going to talk primarily about Bloom filters and their impact on performance, but I’ll need to say something about the “virtual tables” and “parallel execution message size” before I begin. Take a look at this fragment of a parallel execution plan:

-----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------

|   7 |        PX SEND BROADCAST    | :TQ10000 |     3 |    18 |     2 |  Q1,00 | P->P | BROADCAST  |
|   8 |         PX BLOCK ITERATOR   |          |     3 |    18 |     2 |  Q1,00 | PCWC |            |
|*  9 |          TABLE ACCESS FULL  | T3       |     3 |    18 |     2 |  Q1,00 | PCWP |            |

-----------------------------------------------------------------------------------------------------

Examining the Name and TQ columns, we can say (from the TQ column) that one set of parallel execution slaves has operated steps 7, 8, and 9 to populate a “virtual table” called :TQ10000 which is broadcast (every slave receives every row) to another set of parallel execution slaves. Each slave in the other slave set will read from that virtual table (at some point) and do something with the data.

I am fairly certain (and it is an important point in understanding the mechanics and rationale of parallel execution plans) that a parallel slave (set) cannot write one virtual table while reading from another virtual table; it has to read the whole of the “incoming” virtual table before it can start writing to an “outgoing” virtual table. This restriction probably keeps the code as simple as possible, partly because it limits the need for parallel execution slave sets to a maximum of 2 for a data flow operation. Bear in mind that there is also (probably) a “side-channel” that the query co-ordinator uses to pass instructions to the parallel query slaves, and this side-channel probably falls outside any “can’t read and write simultaneously” restriction.

So how does Oracle implement virtual tables ? A virtual table is simply a small number of “pages” of memory, whose size is dictated by the parameter _parallel_execution_message_size. The default value for this parameter is either 2,148 bytes 4,096 bytes, or 16,384 bytes, depending the Oracle version and the setting for various other parameters. When I say “small number”, though, this is per communication channel and every slave in one slave set has to be able to pass data to every slave in the other slave set, so if you’re running a query with degree N you need to allocate a total of N-squared times that small number of pages.  (In fact, since both sets of slaves may also have to send data to the query co-ordinator at some stage (e.g. while “ranging”) you may need a further 2N sets of pages – N for each slave set – for those channels).

I recall reading a document many years ago (in the era of 7.2 or 7.3 – when we has OCIS rather than MoS) which said the number of pages required was 3, though I’ve also seen a blog note recently that says it’s 4 in RAC systems.  So, assume you have a query that is running parallel 5 in single instance, and your _parallel_execution_message_size is 16,384 bytes, then the amount of memory you need for virtual tables for that query is something like: 5 * 5 * 3 * 16KB = 1,200KB (or maybe 5 * 7 * 3 * 16KB = 1,680KB if my comment about the query co-ordinator is correct).

Again depending on version and parameter settings, this memory will be located either in the shared pool or large pool with the name “PX msg pool”; if you run a series of tests that continuously increases the degree of parallelism on a fixed query you’ll see this pool size growing at roughly the right rate to match the formulae above (although the approximation was somewhat obscured when I checked as Oracle was allocating the pool in chunks of 120KB + 24 bytes).

Given the small number of pages available on any channel for reading and writing, you can imagine that a writer will sometimes find that all the available pages for a given channel are full and it has to wait for the reader to empty them before it can write the next set of data. This leads to the two waits: “PX Deq Credit: send blkd” and “PX Deq Credit: need buffer”.  I’m not sure exactly why there are two different waits,  but it’s possible that one is an explicit block (e.g.: “send blkd” = the reader is now reading the buffers and has made them unavailable, and “need buffer” = all the buffers are full but the reader is currently doing something else and hasn’t noticed).

There is an important performance consideration that goes along with this blocking – if a reader starts spilling its data to disc that’s a (relatively) slow process – which means that ALL the writers trying to send data to that reader are likely to get blocked waiting each time that it spills a few more pages to disc, so one (unlucky) reader could produce far more delay to the query than just the time it spends writing (if I spill my pages to disc none of the slaves in the other slave set can send data to me until I’ve finished, which means they don’t send to data to anyone else either). Moreover if a single writer tends to send most of its data to a single reader then that writer will effectively be causing a block on all the other writers because they may have to wait for that one reader to spill the data from that one writer every time they try to send data to that one reader.

Clearly, spilling to disc is a bad thing – it’s going to slow the query down whatever the circumstances, but an imbalance in the sharing of data between readers and writes could have an impact far greater than we might at first expect. Any strategy that minimises the risk of spilling to disk is likely to be a good idea – and that’s where Bloom filters come into play; if we can find a cheap way of telling one set of slaves about the data that the other set doesn’t need then we can reduce traffic through the virtual tables and reduce the time lost through “need buffer” and “send blkd” waits.

The time when it’s really important to find a way to eliminate data before it goes through a table queue is when we do a hash join. Remember a hash join takes place in two pieces – we extract data from one table and build an (in-memory) hash table, then we extract data from the other table and probe the hash table for each row extracted; but when we are actually extracting the data we don’t consider the join condition. When a hash join goes parallel we may be in a position where one slave set holds the in-memory hash table and the other slave set does the work of scanning the second table – in this case we would like the second slave set to discard as much data as possible before sending it through the table queue … but it doesn’t have any information about the join condition.

A Bloom filter is simply a small dataset created by the first slave set that is a “summary” of the values that are likely to be useful in the join. As it builds the in-memory hash table, the first slave set can construct a Bloom filter for the join values; when the hash table is complete the first slave set can pass the Bloom filter to the second slave set (possibly by way of the query co-ordinator, I haven’t checked the mechanism yet) allowing the second slave set to discard the data that definitely won’t be needed and passing only data that will probably be needed through the table queue.  (Note: the Bloom filter is not a perfect filter, it may allow some data that isn’t needed. For more detailed information about Bloom filters, see this article by Christian Antognini.)

Let’s take a look at a simple example to demonstrate the principle – I’ve created tables t1 and t2 as “select * from all_objects” (some 70,000 rows in 11.2.0.4) and gathered stats with no histograms. Now I’m going to run three queries, first an unfiltered join between the two tables, then a filtered join showing a Bloom filter magically appearing, then I’ll repeat the filtered query after disabling Bloom filtering so that we can see the benefits of the filter. In all three cases I’ll enable rowsource execution statistics so that you can see the volume of data that went through the table queues. Here’s the first, unfiltered, query:


select
	/*+
		parallel(t1 2)
		parallel(t2 2)
		leading(t1 t2)
		pq_distribute(t2 hash hash)
	*/
	t1.object_name, t2.object_name, rpad('x',100,'x') padding
from	t1, t2
where
	t2.object_id = t1.object_id
;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.28 |       0 |    310 |    310 |  5952K|  1953K|     2/0/0|    2048 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.08 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  70101 |00:00:00.05 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  70101 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.06 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So, following the table queues:

  • slave set 2 scans t1 and distributes it to slave set 1 through virtual table 0, and slave set 1 builds an in-memory hash table: (implied by lines 3 – 7).
  • slave set 2 scans t2 and distributes it to slave set 1 through virtual table 1, and slave set 1  buffers it until the scan has completed and all relevant data has arrived: (implied by lines 3, 8 – 11)
  • slave set 1 does the hash join (line 3) reading back the second data set from its buffer (which may have spilled to disk) and passes the results on to the query coordinator through virtual table 2.

Now lets restrict the data we select from table t1.

select
	/*+
		parallel(t1 2)
		parallel(t2 2)
		leading(t1 t2)
		pq_distribute(t2 hash hash)
	 */
	t1.object_name, t2.object_name, rpad('x',100,'x') padding
from	t1, t2
where
	t2.object_id = t1.object_id
and	t1.owner = 'SYS'
;

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |          |        |      |            |  34463 |00:00:00.10 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR          |          |      1 |          |        |      |            |  34463 |00:00:00.10 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)    | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED    |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.16 |       0 |    124 |    124 |  3611K|  1953K|     2/0/0|    1024 |
|   4 |     JOIN FILTER CREATE   | :BF0000  |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.05 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.04 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   7 |        PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  34463 |00:00:00.03 |    1031 |   1005 |      0 |       |       |          |         |
|*  8 |         TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  34463 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
|   9 |     PX RECEIVE           |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  38848 |00:00:00.05 |       0 |      0 |      0 |       |       |          |         |
|  10 |      PX SEND HASH        | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  11 |       JOIN FILTER USE    | :BF0000  |      2 | 00:00:01 |  Q1,01 | PCWP |            |  38848 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|  12 |        PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  38848 |00:00:00.03 |    1031 |   1005 |      0 |       |       |          |         |
|* 13 |         TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  38848 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   8 - access(:Z>=:Z AND :Z<=:Z)
        filter("T1"."OWNER"='SYS')
   13 - access(:Z>=:Z AND :Z<=:Z)
        filter(SYS_OP_BLOOM_FILTER(:BF0000,"T2"."OBJECT_ID"))

In this case the basic plan is the same although we’ve introduced two extra lines, a “JOIN FILTER CREATE” at line 4, this is slave set 1 creating a Bloom filter as it constructs the in-memory hash table, and a “JOIN FILTER USE” at line 11, which is slave set 2 discarding data from the tablescan of t2 before sending data through virtual table 1.

Looking at predicate information for line 13 we can see the filter predicate using a function called sys_op_bloom_filter() to eliminate data. Oracle has decided that the number of distinct values of object_id returned by the predicate OWNER = ‘SYS’ is sufficiently small that it’s worth building a Bloom filter describing it and passing it to the other slave set. (In fact the optimizer has been a little over-optimistic in this case because I didn’t gather any histograms on the table – there’s a lot more SYS-owned data than the optimizer is expecting).

We can see that the Bloom filter has been pretty effective. Since the two tables have the same data (plus or minus one row) the volume that would be returned by a perfect filter at line 13 ought to match the volume produced from the first table at line 8 – in fact producing about 12.5% more than the ideal, but that’s still a big improvement over sending the whole table through the virtual table.

Just to finish off, here are the results when I disable Bloom filtering using the /*+ no_px_join_filter(t2) */ hint. (This hint appeared, along with the opposite hint /*+ px_join_filter(alias) */ and Bloom filters, in 10gR2 but didn’t get into outlines and SQL Plan baselines until 11.1.0.6

Plan hash value: 3683239666

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  34463 |00:00:00.22 |      10 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  34463 |00:00:00.22 |      10 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.21 |       0 |     93 |     93 |  3611K|  1953K|     2/0/0|    1024 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.04 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  34463 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  34463 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.07 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.06 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   7 - access(:Z>=:Z AND :Z<=:Z)
        filter("T1"."OWNER"='SYS')
   11 - access(:Z>=:Z AND :Z<=:Z)

You’ll notice we’re back to 11 lines, there’s no reference to JOIN FILTER and the predicate section no longer has a call to sys_op_bloom_filter().
More significantly the actual rows reported in line 11 is the full 70,000 for the whole table – the entire data set will be passing through virtual table 1, increasing the CPU load, the probability of “send blkd” and “need buffer”, and the changes to the BUFFERED data spilling to disc.

That’s nearly half a day gone writing up the preamble to how to interpret the execution plan for my original query after I hinted it into a hash/hash distribution – but it’s a lovely day outside so I’m heading out to the garden with my pruning secateurs and it’s going to be a few more days (I hope) before the last part of the series comes out.


Predicate Order

Wed, 2014-02-26 02:14

Common internet question: does the order of predicates in the where clause make a difference.
General answer: It shouldn’t, but sometimes it will thanks to defects in the optimizer.

There’s a nicely presented example on the OTN database forum where predicate order does matter (between 10.1.x.x and 11.1.0.7). Notnne particularly – there’s a script to recreate the issue; note, also, the significance of the predicate section of the execution plan.
It’s bug 6782665, fixed in 11.2.0.1


FBI Skip Scan

Tue, 2014-02-25 12:45

A recent posting on the OTN database forum highlighted a bug (or defect, or limitation) in the way that the optimizer handles index skip scans with “function-based” indexes – it doesn’t do them. The defect has probably been around for a long time and demonstrates a common problem with testing Oracle – it’s very easy for errors in the slightly unusual cases to be missed; it also demonstrates a general principle that it can take some time for a (small) new feature to be applied consistently across the board.

The index definitions in the original posting included expressions like substr(nls_lower(colX), 1, 25), and it’s possible for all sorts of unexpected effects to appear when your code starts running into NLS  settings, so I’ve created a much simpler example. Here’s my table definition, with three index definitions:

create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	mod(rownum - 1,10)		mod1,
	rownum				id1,
	1 + mod(rownum - 1,10)		mod2,
	1 + rownum 			id2,
	lpad(rownum,7)			small_vc,
	rpad('x',100)			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5 ; create index t1_normal on t1(mod2, id2); create index t1_split  on t1(mod2, id1 + 1); create index t1_fbi    on t1(mod1 + 1, id1 + 1); begin 	dbms_stats.gather_table_stats( 		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

Note that I gathered stats after creating the indexes (generally you would be safe creating the indexes after the gathering the stats from 10g onwards) because I needed to gather stats on the virtual columns that support the function-based indexes. As you can see, mod2 = 1 + mod1, and id2 = 1 + id1, and I’ve created three indexes which (internally) are the same although they are defined in three different ways.

So here are three queries which select the same data although, again, the queries are not all exactly the same. Since the queries are doing the same things with structures that are the same we might hope to see the optimizer using the same strategy for all three. In fact, to make things easier for the optimizer, I’ve even told it exactly what to do in two of the cases:


select	*
from	t1
where
	id2 = 50000
;

select	/*+ index_ss(t1(mod2)) */
	*
from	t1
where
	id1 + 1 = 50000
;

select	/*+ index_ss(t1 t1_fbi) */
	*
from	t1
where
	id1 + 1 = 50000
;

In the first query I’ve referenced only the second column of the “normal” index with a high precision query – expecting the optimizer to find the index skip scan path. In the second and third queries I’ve reference the (id1 + 1) index expression which appears as the second column in two different indexes; to help the optimizer along I’ve hinted the index skip scan for each of the indexes in turn.

If the first query can do an index skip scan then the second and third queries should be able to do the same because the physical structures and the statistics (and the actual stored and requested values) are all the same. Here are the three plans (pulled from memory in an 11.2.0.4 instance).


SQL_ID  bvjstz9xa3n7a, child number 0
-------------------------------------
select * from t1 where  id2 = 50000

Plan hash value: 2078113469

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |       |       |    12 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |     1 |   125 |    12   (0)| 00:00:01 |
|*  2 |   INDEX SKIP SCAN           | T1_NORMAL |     1 |       |    11   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID2"=50000)
       filter("ID2"=50000)

SQL_ID  95573qx3w62q2, child number 0
-------------------------------------
select /*+ index_ss(t1(mod2)) */  * from t1 where  id1 + 1 = 50000

Plan hash value: 1422030023

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |       |       |   271 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1       |     1 |   130 |   271   (2)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | T1_SPLIT |     1 |       |   270   (2)| 00:00:02 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."SYS_NC00007$"=50000)
       filter("T1"."SYS_NC00007$"=50000)

SQL_ID  d4t48va4zrk1q, child number 0
-------------------------------------
select /*+ index_ss(t1 t1_fbi) */  * from t1 where  id1 + 1 = 50000

Plan hash value: 3151334857

--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |       |       |   271 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1     |     1 |   130 |   271   (2)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | T1_FBI |     1 |       |   270   (2)| 00:00:02 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."SYS_NC00007$"=50000)
       filter("T1"."SYS_NC00007$"=50000)

The first query automatically does an index skip scan on the normal index.

The second and third queries calculate the cardinality correctly, reference the right “virtual” column correctly, and come close to obeying the hint by using the required index – and then spoil things by doing an index full scan instead of an index skip scan. Checking the 10053 trace file I found that the optimizer hadn’t even considered the possibility of a skip scan for the last two queries – it had jumped straight to the full scan. (If the optimizer had ignored the hint completely I would have said that this behaviour was simply a limitation of FBIs – but since the optimizer has got partway there I think it’s probably a bug.)

Once you’ve got this far, of course, you might wonder if you could work around the problem in 11g by using virtual columns – so that’s the next test.

drop index t1_split;
drop index t1_fbi;

alter table t1 add (mod_virtual generated always as (mod1 + 1) virtual);
alter table t1 add (id_virtual generated always as (id1 + 1) virtual);

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create index t1_virtual on t1(mod_virtual, id_virtual);

select
	/*+ index_ss(t1) */
	*
from	t1
where
	id_virtual = 50000

I started by dropping the two function-based indexes (because I’m about to create “proper” virtual columns which are identical to the hidden “virtual columns” that were supporting the function-based indexes – if I didn’t do this I’d hit Oracle error “ORA-54015: Duplicate column expression was specified”). Then I declared my virtual columns, collected stats on the whole table (I could have used method_opt=>’for all hidden columns size 1′) so that there would be stats on those columns, then created an index using the virtual columns and tried to get a skip scan on that index. Any bets ?

select  /*+ index_ss(t1) */  * from t1 where  id_virtual = 50000

Plan hash value: 4250443541

------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |       |       |   271 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1         |     1 |   133 |   271   (2)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | T1_VIRTUAL |     1 |       |   270   (2)| 00:00:02 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID_VIRTUAL"=50000)
       filter("ID_VIRTUAL"=50000)

Bad luck – even though “proper” virtual columns can do wonders, they don’t get around this problem. Anyone who’s used multi-column function-based indexes or added virtual columns to indexes might want to look at their code and execution paths to see if there are opportunities for skip scans that the optimizer is failing to take. If there are you might like to raise an SR with Oracle – and point them to this blog as a repeatable test case.

I tried several searches for a relevant bug on MoS – unfortunately it was one of those occasions where the search result was either very small and didn’t have what I wanted, or was so large that I wasn’t going to check all the hits. If anyone does find a bug number, please tell us about it in the comments.

 


Indexing LOBs

Fri, 2014-02-21 12:42

Many years ago, possibly when most sites were still using Oracle 8i, a possible solution to a particular customer problem was to create a function-based index on a CLOB column using the dbms_lob.getlength() function call. I can’t find the notes explaining why this was necessary (I usually have some sort of clue – such as the client name – in the script, but in this case all I had was a comment that “the manuals say you can’t do this, but it works provided you wrap the dbms_lob call inside a deterministic function”).

I never worked out why the dbms_lob.getlength() function wasn’t declared as deterministic – especially since it came complete with a most restrictive restricts_references pragma – so I had just assumed there was probably some good reason based on strange side effects when national language charactersets came into play. But here’s a little detail I noticed recently about the dbms_lob.getlength() function: it became deterministic in 11g, so if the client decided to implement my suggestion (which included the usual sorts of warnings) it’s now legal !

Footnote – the length() function has been deterministic and usable with LOBs for a long time, certainly since late 9i, but in 8i length(lob_col) will produce Oracle error “ORA-00932: inconsistent datatypes”


Index Compression – aargh

Fri, 2014-02-21 01:57

The problem with telling people that some feature of Oracle is a “good thing” is that some of those people will go ahead and use it; and if enough people use it some of them will discover a hitherto undiscovered defect. Almost inevitably the bug will turn out to be one of those “combinations” bugs that leaves you thinking: “Why the {insert preferred expression of disbelief here} should {feature X} have anything to do with {feature Y}”.

Here – based on index compression, as you may have guessed from the title – is one such bug. I got it first on 11.1.0.7, but it’s still there on 11.2.0.4 and 12.1.0.1


create table t1 (
	id1	number not null,
	id2	number not null,
	n1	number,
	v1	varchar2(10)
)
rowdependencies
;

alter table t1 add constraint t1_pk primary key(id1, id2) using index (
	create index t1_pk on t1(id1, id2) compress 1
);

create table t2(
	id1	number not null,
	id2	number not null,
	id3	number not null,
	n1	number,
	v1	varchar2(10)
)
rowdependencies
;

alter table t2 add constraint t2_fk_t1 foreign key(id1, id2) references t1;

It’s quite simple – I’ve got a multi-column primary key, and it’s worth compressing on the first column because that column is fairly repetitive. Then I’ve created another table that has a foreign key constraint referencing my first table. Because I’ve got some replication going on and want to enable parallelism I’ve enabled rowdependencies all over the place. So let’s insert a couple of rows and see what happens – the next bit of text is cut-n-pasted from an 11.2.0.4 SQL*Plus session running a script after a call to set echo on:


SQL> insert into t1 values(1,1,1,'x');

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL> insert into t2 values(1,1,1,1,'x');
insert into t2 values(1,1,1,1,'x')
            *
ERROR at line 1:
ORA-00600: internal error code, arguments: [25027], [5], [394178], [], [], [], [], [], [], [], [], []

For further details, and before you get completely thrilled at the possibility of compressing lots of indexes, keep an eye on:

Bug 18125878 : ORA-600 [25027] ON INSERT IF TABLE HAS ROWDEPENDENCIES AND COMPRESS ON PK

“Inserting into a table with a foreign key where the base table has a primary key using index key compression and the table also has row dependencies enabled. Stack will include kdsgrds and kdiexi0 (in 12) / kdiexi (in 11, 10)”

As implied by that note from the bug, it also affects 10g.  The bug note reports it as fixed in 12.2.


Recursive subquery factoring

Sun, 2014-02-16 12:11

This is possibly my longest title to date – I try to keep them short enough to fit the right hand column of the blog without wrapping – but I couldn’t think of a good way to shorten it (Personally I prefer to use the expression CTE – common table expression – over “factored subquery” or “subquery factoring” or “with subquery”, and that would have achieved my goal, but might not have meant anything to most people.)

If you haven’t come across them before, recursive CTEs appeared in 11.2, are in the ANSI standard, and are (probably) viewed by Oracle as the strategic replacement for “connect by” queries. Here’s a simple (and silly) example:


with data(p) as (
	select 1 p from dual
	union all
	select p + 1 from data where p < 100
)
select	p
from	data
where	rownum <= 10
;

         P
----------
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10

10 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 37253879

---------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |     2 |    26 |     4   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY                             |      |       |       |            |          |
|   2 |   VIEW                                     |      |     2 |    26 |     4   (0)| 00:00:01 |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |       |       |            |          |
|   4 |     FAST DUAL                              |      |     1 |       |     2   (0)| 00:00:01 |
|*  5 |     RECURSIVE WITH PUMP                    |      |       |       |            |          |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   5 - filter("P"<100)

A recursive CTE has three features that identify it. First, the query alias (“data” in this case) is followed by a list of column aliases; secondly the query includes a UNION ALL; and thirdly the second subquery in the UNION ALL references the query alias – i.e. it’s the recursive bit. There are other optional bits but I’m not planning to go into those – all I want to talk about is how to control the materialization (or not) of a recursive CTE through hinting.

The reason I wrote this note was because Jeff Jacobs, in his presentation on “Performance Anti-patterns” at RMOUG last week, raised the question of whether or not the /*+ materialize */ and /*+ inline */ hints worked with recursive CTEs and gave an example of a UNION ALL query where the CTE always materialized, no matter how you applied the /*+ inline */ hint. The CTE seemed to be following the basic guideline for CTEs – if you use it once in the main query it goes inline, if you use it more than once it will (almost invariably) materialize.

I’m always interested in examples where “the hint is ignored”, so I exchanged a couple of email messages with Jeff and he sent me an example (which I’ve simplified for this blog) of a query that demonstrated the issue; and I spent a little while thinking about it and decided that it simply wasn’t possible to hint the code the way we wanted to and it was just one of those cases where it takes a bit of time for new features to catch up and fit in to the standard framework. Here’s a simplified version of the query, with its execution plan:

with data(p) as (
	select 1 p from dual
	union all
	select p + 1 from data where p < 100
)
select	p
from	data
where	rownum <= 10
union all
select	p
from	data
where	rownum <= 10
;

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                            |     4 |    52 |     4   (0)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION                 |                            |       |       |            |          |
|   2 |   LOAD AS SELECT                           | SYS_TEMP_0FD9D6608_7391CD7 |       |       |            |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|                            |       |       |            |          |
|   4 |     FAST DUAL                              |                            |     1 |       |     2   (0)| 00:00:01 |
|*  5 |     RECURSIVE WITH PUMP                    |                            |       |       |            |          |
|   6 |   UNION-ALL                                |                            |       |       |            |          |
|*  7 |    COUNT STOPKEY                           |                            |       |       |            |          |
|   8 |     VIEW                                   |                            |     2 |    26 |     2   (0)| 00:00:01 |
|   9 |      TABLE ACCESS FULL                     | SYS_TEMP_0FD9D6608_7391CD7 |     2 |    12 |     2   (0)| 00:00:01 |
|* 10 |    COUNT STOPKEY                           |                            |       |       |            |          |
|  11 |     VIEW                                   |                            |     2 |    26 |     2   (0)| 00:00:01 |
|  12 |      TABLE ACCESS FULL                     | SYS_TEMP_0FD9D6608_7391CD7 |     2 |    12 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("P"<100)
   7 - filter(ROWNUM<=10)
  10 - filter(ROWNUM<=10)

The following morning I woke up with one of those “overnight insights” where you seem to have worked out the answer in your sleep. To make a hint work you have to put it in the right query block, or you have to name the right query block in the main query block: in this case the right query block doesn’t exist in the text, and it’s not possible to figure out what the name of the right query block would be if it came into existence.

If you try putting the /*+ inline */ hint into the query after the select at line 2 above, you’ve put the hint into the first query block of a union all, NOT into the query block of the recursvie CTE.

Having identified the problem, the solution (or at least, a possible solution) was obvious – create the query block you need. This (with its execution plan from 11.2.0.4) is what worked:

with data(p) as (
	select 1 p from dual
	union all
	select p + 1 from data where p < 100
),
data1 as (
	select /*+ inline */ * from data
)
select	p
from	(
	select * from data1 where rownum <= 10
	union all
	select * from data1 where rownum <= 10
	)
;

-----------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |      |     4 |    52 |     8   (0)| 00:00:01 |
|   1 |  VIEW                                        |      |     4 |    52 |     8   (0)| 00:00:01 |
|   2 |   UNION-ALL                                  |      |       |       |            |          |
|*  3 |    COUNT STOPKEY                             |      |       |       |            |          |
|   4 |     VIEW                                     |      |     2 |    26 |     4   (0)| 00:00:01 |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |       |       |            |          |
|   6 |       FAST DUAL                              |      |     1 |       |     2   (0)| 00:00:01 |
|*  7 |       RECURSIVE WITH PUMP                    |      |       |       |            |          |
|*  8 |    COUNT STOPKEY                             |      |       |       |            |          |
|   9 |     VIEW                                     |      |     2 |    26 |     4   (0)| 00:00:01 |
|  10 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |       |       |            |          |
|  11 |       FAST DUAL                              |      |     1 |       |     2   (0)| 00:00:01 |
|* 12 |       RECURSIVE WITH PUMP                    |      |       |       |            |          |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(ROWNUM<=10)
   7 - filter("P"<100)
   8 - filter(ROWNUM<=10)
  12 - filter("P"<100)

All I’ve done is create a second CTE that selects from the first CTE. This is now a simple select, so I can add a perfectly placed hint to it – in the hope that this would effectively require the dependent recursive CTE to be inlined inside it. It seems to be sufficient.

I haven’t tested the strategy exhaustively – so I can give you no guarantee that this has to work – unfortunately I did have another example that I applied the method to, and after several seconds of no response it crashed with an ORA-00600 error :( but that might have been a side effect of the nature of query (it included a couple of the optional extras) rather than a specific feature of inlining.)

[Further reading on "ignoring hints"]

[Further reading on "subquery factoring"]


12c Subquery Factoring

Fri, 2014-02-14 05:44

From time to time I’ve posted a reminder that subquery factoring (“with subquery”) can give you changes in execution plans even if the subquery that you’ve taken out of line is written back inline by Oracle rather than being materialized. This can still happen in 12c – here’s a sample query in the two forms with the result sets and execution plans.  First, the “factored” version:


with e as (
	select
		deptno
	from	emp
	order by
		deptno
)
select
	deptno, count(*)
from
	e
group by deptno
;

    DEPTNO   COUNT(*)
---------- ----------
        10          3
        20          5
        30          6

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1536940522

-------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost  |
-------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     3 |     9 |     6 |
|   1 |  SORT GROUP BY NOSORT|      |     3 |     9 |     6 |
|   2 |   VIEW               |      |    14 |    42 |     6 |
|   3 |    SORT ORDER BY     |      |    14 |    42 |     6 |
|   4 |     TABLE ACCESS FULL| EMP  |    14 |    42 |     2 |
-------------------------------------------------------------

Note that the view has not been materialized (no “TEMP TABLE TRANSFORMATION”). There is a “sort group by” with NOSORT, and the data has been reported in sorted order as a side effect of the “sort order by” in line 3.

Now with the inline view manually written in place:


select   deptno, count(*)
from (
	select
		/*+ gather_plan_statistics */
		deptno
	from	emp
	order by
		deptno
) e
group by deptno
;

    DEPTNO   COUNT(*)
---------- ----------
        30          6
        20          5
        10          3

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4067220884

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     3 |     9 |     6 |
|   1 |  HASH GROUP BY     |      |     3 |     9 |     6 |
|   2 |   TABLE ACCESS FULL| EMP  |    14 |    42 |     2 |
-----------------------------------------------------------

In principle, the optimizer should have rewritten the first query in the second form and optimized it – but there’s clearly something out of order in the optimisation. As you can see, the result is no longer in the same order (but we don’t expect any specific order, of course, since we didn’t end with an “order by” clause) and the aggregation step is a hash group by rather than a sort group by.

Footnote:

I know that I got this example off someone else, but all I’ve written into my script is a URL, and it’s the wrong URL – my clipboard buffer seems to lag a little sometimes when I’m copying from the host to a virtual machine. The URL I’ve got is to an article by Steven Pinker in the New Republic, and it had nothing to do with Oracle (though it was comparing science and the humanities).

If you’re the original source of the example, please let me know so that I can add an acknowledgement.

[Further reading on "subquery factoring"]

Test case

If you would like to reproduce the example, here’s the script – it’s a simple hack of the Scott emp/dept schema.

CREATE TABLE EMP (
	EMPNO		NUMBER(4) NOT NULL,
        ENAME		VARCHAR2(10),
        JOB		VARCHAR2(9),
        MGR		NUMBER(4),
        HIREDATE	DATE,
        SAL		NUMBER(7, 2),
        COMM		NUMBER(7, 2),
        DEPTNO		NUMBER(2)
);

insert into emp values
        (7369, 'SMITH',  'CLERK',     7902,
        to_date('17-DEC-1980', 'DD-MON-YYYY'),  800, NULL, 20);

insert into emp values
        (7499, 'ALLEN',  'SALESMAN',  7698,
        to_date('20-FEB-1981', 'DD-MON-YYYY'), 1600,  300, 30);

insert into emp values
        (7521, 'WARD',   'SALESMAN',  7698,
        to_date('22-FEB-1981', 'DD-MON-YYYY'), 1250,  500, 30);

insert into emp values
        (7566, 'JONES',  'MANAGER',   7839,
        to_date('2-APR-1981', 'DD-MON-YYYY'),  2975, NULL, 20);

insert into emp values
        (7654, 'MARTIN', 'SALESMAN',  7698,
        to_date('28-SEP-1981', 'DD-MON-YYYY'), 1250, 1400, 30);

insert into emp values
        (7698, 'BLAKE',  'MANAGER',   7839,
        to_date('1-MAY-1981', 'DD-MON-YYYY'),  2850, NULL, 30);

insert into emp values
        (7782, 'CLARK',  'MANAGER',   7839,
        to_date('9-JUN-1981', 'DD-MON-YYYY'),  2450, NULL, 10);

insert into emp values
        (7788, 'SCOTT',  'ANALYST',   7566,
        to_date('09-DEC-1982', 'DD-MON-YYYY'), 3000, NULL, 20);

insert into emp values
        (7839, 'KING',   'PRESIDENT', NULL,
        to_date('17-NOV-1981', 'DD-MON-YYYY'), 5000, NULL, 10);

insert into emp values
        (7844, 'TURNER', 'SALESMAN',  7698,
        to_date('8-SEP-1981', 'DD-MON-YYYY'),  1500,    0, 30);

insert into emp values
        (7876, 'ADAMS',  'CLERK',     7788,
        to_date('12-JAN-1983', 'DD-MON-YYYY'), 1100, NULL, 20);

insert into emp values
        (7900, 'JAMES',  'CLERK',     7698,
        to_date('3-DEC-1981', 'DD-MON-YYYY'),   950, NULL, 30);

insert into emp values
        (7902, 'FORD',   'ANALYST',   7566,
        to_date('3-DEC-1981', 'DD-MON-YYYY'),  3000, NULL, 20);

insert into emp values
        (7934, 'MILLER', 'CLERK',     7782,
        to_date('23-JAN-1982', 'DD-MON-YYYY'), 1300, NULL, 10);

CREATE TABLE DEPT(
	DEPTNO	NUMBER(2),
        DNAME	VARCHAR2(14),
        LOC	VARCHAR2(13)
);

insert into dept values (10, 'ACCOUNTING', 'NEW YORK');
insert into dept values (20, 'RESEARCH',   'DALLAS');
insert into dept values (30, 'SALES',      'CHICAGO');
insert into dept values (40, 'OPERATIONS', 'BOSTON');

commit;

alter table emp add constraint emp_pk primary key(empno);
alter table dept add constraint dept_pk primary key(deptno);


Caution – hints

Wed, 2014-02-12 12:57

Here’s a little example of why you should be very cautious about implementing undocumented discoveries. If you take a look at the view v$sql_hints in 11.2.0.4 you’ll discover a hint (no_)cluster_by_rowid; and if you look in v$parameter you’ll discover two new parameters _optimizer_cluster_by_rowid and _optimizer_cluster_by_rowid_control.

It doesn’t take much imagination to guess that the parameters and hint have something to do with the costs of accessing compressed data by rowid on an Exadata system (see, for example, this posting) and it’s very easy to check what the hint does:


create table t1
as
select
	rownum					id,
	trunc(dbms_random.value(0,1000))	n1,
	trunc(dbms_random.value(0,1000))	n2,
	rpad('x',100,'x')			padding
from
	all_objects
where
	rownum <= 10000
;

create index t1_i1 on t1(n1,n2);

-- collect stats

set autotrace on explain

select
	/*+
		index(t1)
		cluster_by_rowid(t1)
	*/
	id
from
	t1
where
	n1 = 50
;

I’ve created an index on the two randomly generated columns (n1, n2) and then forced Oracle to walk the index searching for n1 = 50. Because of the presence of n2 in the index the order in which Oracle will visit the rows will be randomised relative to the order in which they are stored in the table – but if I can get Oracle to sort the rowids into order before visiting the table I could (in theory) get a performance benefit from minimising the number of consistent gets I do (some will turn into “buffer is pinned count”), and in the Exadata case I could get a big improvement by ensuring that I decompressed each compression unit no more than once. That’s what the hint does – and to prove the point I can select the id column – if the rows are being visited in order of rowid then the output will be in order of id:


        ID
----------
       198
       681
      1257
      1318
      5593
      7094
      7471
      8048
      8798
      8855
      9670

11 rows selected.

---------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    10 |    80 |    12 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    10 |    80 |    12 |
|   2 |   SORT CLUSTER BY ROWID     |       |    10 |       |     2 |
|*  3 |    INDEX RANGE SCAN         | T1_I1 |    10 |       |     2 |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=50)

Sure enough – with the hint the data arrives in sorted order, without it the results are not sorted. And, of course, we can check the execution plan and see that the optimizer has inserted a SORT operation between the index range scan and the table access by rowid.

So here’s the fun bit – the parameters are still there in 12.1.0.1 (and _optimizer_cluster_by_rowid is now true by default when it used to be false) but the hint is no longer available. So if you want the sort to take place, you have to depend (presumably) on the luck of the statistics – when does Oracle think it’s worth doing automatically – or you still have to write the query as a two-part job with an inline, non-mergeable view that accesses and sorts the rowids before visiting the table.


11.2.0.4 Upgrade

Wed, 2014-02-12 12:01

A warning on Oracle-L from Chris Dunscombe: If you’ve got a large stats history – with lots of histogram data – then the upgrade could take an unexpectedly long time. Presumably the same is true if you upgrade from 11.2.0.3 (or earlier) to 12c.