Jonathan Lewis

Subscribe to Jonathan Lewis feed Jonathan Lewis
Just another Oracle weblog
Updated: 3 min 39 sec ago

Securefile space

Tue, 2016-09-13 01:29

A script hacked together a couple of years ago from a clone of a script I’d been using for checking space usage in the older types of segments. Oracle Corp. eventually put together a routine to peer inside securefile LOBs:

rem
rem     Script:         dbms_space_use_sf.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2013
rem     Purpose:
rem
rem     Last tested
rem             12.1.0.1
rem             11.2.0.4
rem     Not tested
rem             11.1.0.7
rem     Not relevant
rem             10.2.0.5
rem              9.2.0.8
rem              8.1.7.4
rem
rem     Notes:
rem     See also dbms_space_use.sql
rem
rem     11g introduced securefiles lobs and two overloads of
rem     dbms_space_usage to report space used by their segments
rem
rem     Valid values for suoption are:
rem             SPACEUSAGE_EXACT (16): Computes space usage exhaustively
rem             SPACEUSAGE_FAST  (17): Retrieves values from in-memory statistics
rem


define m_seg_owner      = &1
define m_seg_name       = &2
define m_seg_type       = &3

define m_segment_owner  = &m_seg_owner
define m_segment_name   = &m_seg_name
define m_segment_type   = &m_seg_type

spool dbms_space_use_sf

prompt  ============
prompt  Secure files
prompt  ============

declare
        wrong_ssm       exception;
        pragma exception_init(wrong_ssm, -10614);

        m_segment_size_blocks   number(12,0);
        m_segment_size_bytes    number(12,0);
        m_used_blocks           number(12,0);
        m_used_bytes            number(12,0);
        m_expired_blocks        number(12,0);
        m_expired_bytes         number(12,0);
        m_unexpired_blocks      number(12,0);
        m_unexpired_bytes       number(12,0);

begin
        dbms_space.space_usage(
                upper('&m_segment_owner'),
                upper('&m_segment_name'),
                upper('&m_segment_type'),
--              PARTITION_NAME          => null,
                suoption                => dbms_space.spaceusage_exact,
--              suoption                => dbms_space.spaceusage_fast,
                segment_size_blocks     => m_segment_size_blocks,
                segment_size_bytes      => m_segment_size_bytes,
                used_blocks             => m_used_blocks,
                used_bytes              => m_used_bytes,
                expired_blocks          => m_expired_blocks,
                expired_bytes           => m_expired_bytes,
                unexpired_blocks        => m_unexpired_blocks,
                unexpired_bytes         => m_unexpired_bytes
        );

        dbms_output.new_line;
        dbms_output.put_line(' Segment Blocks:    '|| m_segment_size_blocks || ' Bytes: '|| m_segment_size_bytes);
        dbms_output.put_line(' Used Blocks:       '|| m_used_blocks         || ' Bytes: '|| m_used_bytes);
        dbms_output.put_line(' Expired Blocks     '|| m_expired_blocks      || ' Bytes: '|| m_expired_bytes);
        dbms_output.put_line(' Unexpired Blocks   '|| m_unexpired_blocks    || ' Bytes: '|| m_unexpired_bytes);

exception
        when wrong_ssm then
                dbms_output.put_line('Segment not ASSM');
end;
/

prompt  ===============
prompt  Generic details
prompt  ===============

declare
        m_TOTAL_BLOCKS                  number;
        m_TOTAL_BYTES                   number;
        m_UNUSED_BLOCKS                 number;
        m_UNUSED_BYTES                  number;
        m_LAST_USED_EXTENT_FILE_ID      number;
        m_LAST_USED_EXTENT_BLOCK_ID     number;
        m_LAST_USED_BLOCK               number;
begin
        dbms_space.UNUSED_SPACE(
                upper('&m_segment_owner'),
                upper('&m_segment_name'),
                upper('&m_segment_type'),
--              PARTITION_NAME                  => null,
                m_TOTAL_BLOCKS,
                m_TOTAL_BYTES,
                m_UNUSED_BLOCKS,
                m_UNUSED_BYTES,
                m_LAST_USED_EXTENT_FILE_ID,
                m_LAST_USED_EXTENT_BLOCK_ID,
                m_LAST_USED_BLOCK
        );

        dbms_output.put_line('Segment Total blocks: '  || m_total_blocks);
        dbms_output.put_line('Object Unused blocks: '  || m_unused_blocks);

end;
/

spool off

Sample of output:


============
Secure files
============

 Segment Blocks:    168960 Bytes: 1384120320
 Used Blocks:       151165 Bytes: 1238343680
 Expired Blocks     17795 Bytes: 145776640
 Unexpired Blocks   0 Bytes: 0

PL/SQL procedure successfully completed.

===============
Generic details
===============
Segment Total blocks: 168960
Object Unused blocks: 0

PL/SQL procedure successfully completed.



Basicfile LOBs 6

Sun, 2016-09-11 13:48

One of the nice things about declaring your (basicfile) LOBs as “enable storage in row” is that the block addresses of the first 12 chunks will be listed in the row and won’t use the LOB index, so if your LOBs are larger than 3960 bytes but otherwise rather small the LOB index will hold only the timestamp entries for deleted LOBs. This makes it just a little easier to pick out the information you need when things behave strangely, so in this installment of my series I’m going to take about an example with with storage enabled in row.

I’m going to demonstrate an issue that is causing a major problem. First I’m going to build a table with a LOB column with multiple (4) freepools – because that’s what you do to handle concurrency – then I’m going to start 4 sessions (carefully checking that I have one associated with each free pool) and do a few thousand inserts with commits from each session. The size of the LOB value I insert will be 20KB so it will be “indexed” in the row but stored out of the row taking 3 LOB blocks.

Once I’ve got the data in place I’m going to use three of the sessions to delete three quarters of the rows from the table then use a call to the dbms_space package to show you that the segment contains virtually no free space. I’ve engineered the code so that it will take just three more rows in the table to fill the available free space and force Oracle either to allocate a new extent or to start reclaiming some of the delete reusable LOB space – and I’m going to run that insert from the session that DIDN’T delete any rows.

I’ve been running these tests on 11.2.0.4, but get similar behaviour on 12c.

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as 
    basicfile 
    text_lob(
            enable storage in row
            chunk 8k
            retention
            nocache
            logging
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('X',20000,'X');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/

truncate table t1
;

You’ll notice I’ve used the retention keyword.  Before I built the LOB I set my undo_retention to 10 seconds so that the space from deleted LOBs should become available for reuse very quickly. The name of the tablespace I’ve used for the LOB is a clue that I’m using an 8KB block size and ASSM (the latter is a requirement of the retention option).

Here’s the code to check which freepool (0 to 3) a session will be associated with (this isn’t documented, but seems to be correct);


select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

So I can keep starting sessions and running that query until I’ve got a session covering each freepool. (The first time I tried this I had to start 7 sessions before I got all 4 freepools covered). Now I can run the following from all 4 sessions concurrently:

define m_loop_counter = 12027

lock table t1 in row share mode;
commit;

declare
        m_v1 varchar2(32767) := rpad('x',20000,'x');
begin
        for i in 1..&m_loop_counter loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

The purpose of the lock table command is to ensure that all 4 processes start running simultaneously. From a fifth session I execute a “lock table t1 in exclusive mode” before starting the other four sessions running, so they all queue on the exclusive lock; then I commit from the fifth session and away we go. The whole thing took about 30 seconds to run. The rather random-looking value 12,027 was a careful choice to ensure that the last extent in the segment had just a few blocks left – and I used my “dbms_space_use.sql” script to check this, getting the following output:

====
ASSM
====

Unformatted                   :        7 /       57,344
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :  144,324 / ############

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145536
Object Unused blocks: 0

I’ve got 7 “unformatted” blocks in the segment – though in fact these might be “formatted but free” from the perspective of the LOB code.

After going to sessions 0, 1, and 3 and deleting 12,000 rows from each in turn (and committing, leaving a total of 12,108 rows in the table) the report doesn’t change: I haven’t made any space free I’ve simply flagged it in the LOB index as “reusable”. So now we go to session 2 and run the following code 3 times – with “set timing on”:


SQL> l
  1  declare
  2     m_v1 varchar2(32767) := rpad('x',20000,'x');
  3  begin
  4     for i in 1..1 loop
  5             insert into t1 values (s1.nextval, m_v1);
  6             commit;
  7     end loop;
  8* end;

The first run took 0.02 seconds – and the unformatted count dropped to 4

The second run took 0.01 seconds – and the unformatted count dropped to 1

The third run took 10.74 seconds, of which 9 seconds was CPU. The session generated 500,000 redo entries totalling 100MB of redo from 1 million db block changes after doing 8.4 million logical I/Os, issuing 108,000 enqueue (lock) requests and running 108,000 index range scans. The report of space usage ended up looking like this:


Unformatted                   :  108,125 /  885,760,000
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   36,333 /  297,639,936

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145664
Object Unused blocks: 0

My session has cleared every single piece of re-usable space from the LOB and made it free (unformatted) before allocating space for its one LOB. (That’s going to hurt when the client has 2 million LOBs on the reusable list and isn’t running everything on SSDs – which is why I’m working on this problem).

If you’re wondering why it takes so much redo and so many buffer visits to free 36,000 LOBs this (roughly) is what Oracle does to free up one reusable LOB of 3 blocks – which corresponds to a single index entry carrying three block ids:

  • Find the lowest index entry in the freepool, pin the index leaf block
  • Identify the last block in the list of 3
  • Lock the relevant L1 space management block for the segment and set relevant “bit” to “unformatted”
  • Delete the index entry
  • Re-insert the index entry with one block id removed
  • Commit and unlock the L1 bitmap block
  • Repeat delete/insert the cycle for 2nd block id
  • Repeat the cycle for 3rd (or 1st since we’re going backwards) block id – but don’t re-insert the index entry

Oracle reclaims one block (chunk) at a time. And that’s a bit of a clue to a possible workaround because event 44951 gets mentioned a couple of times in MoS and on the internet as a workaround to a particular problem of HW enqueue waits for LOBS. MoS note 740075.1 tells us:

When using Automatic Segment Space Management (ASSM), and the fix for Bug 6376915 has been applied in your database (Included in 10.2.0.4 +) it is possible to adjust the number of chunks that are cleaned up when the chunk cleanup operation is required.

This can be enabled by setting event 44951 to a value between 1 and 1024 (default is 1). With the value between 1 and 1024 setting the number of chunks to be cleaned up each time a chunk reclaimation operation occurs. This can therefore reduce the number of requests for the High Watermark Enqueue.

Other notes explain that by default only one chunk is cleaned up at a time – which is exactly the behaviour I’m seeing. So what happens when I bounce the database with this event set at level 5 (an arbitrary choice, but larger than the LOBs I’ve been inserting) in the parameter file and repeat the experiment ? On the first attempt it made no difference, but then I changed the experiment slightly and started again. Initially I had done my first re-insert from the one session that hadn’t deleted any rows – which made it an extreme boundary condition; on the second attempt I deleted two rows from the session that had not yet deleted any data (and waited for the retention time to elapse) before doing the inserts from that session.

Deleting two rows would put 6 blocks (in two index entries) onto my re-usable list, so I was starting the inserts with 7 free blocks, 6 reusable blocks and the event set to level 5. Here’s what I saw as I inserted rows one by one.

  • Insert one row: “Unformatted” blocks went up to 9:  I had freed 5 of the reusable blocks then used 3 of them for my lob (7 + 5 – 3 = 9)
  • Insert one row: “Unformatted” blocks went down to 7: I had freed the last reusable block then used 3 blocks for my lob (9 + 1 – 3 = 7)
  • Insert one row: “Unformatted” blocks went down to 4
  • Insert one row: “Unformatted” blocks went down to 1
  • Insert one row: Oracle cleared all the reusable space (11 seconds, 500MB redo), then added an extent (!) to the segment and used 2 of its blocks for part of the new LOB.

So the event isn’t really connected with my problem – though it adds some efficiency to the processing – and my  “boundary condition” is one that’s likely to occur fairly frequently if you’ve got a basicfile LOB defined with multiple freepools. Fortunately it’s probably going to require two pre-conditions before it’s a big problem: first that you’re handling a large number of LOBs and second that your pattern of inserting and deleting is not symmetric – it’s when you use a large number of concurrent sessions for small batches of inserts but a single session for large bulk deletes that all hell can break loose shortly after a delete.

tl;dr

As with many other features of Oracle, skew plays a part in making things break. If you’re doing lots of inserts and deletes of basicfile lobs make sure the mechanisms you use for inserting and deleting look similar: in particular similar numbers of processes to do similar amounts of work for both operations.

 

P.S. It gets worse.

P.P.S. Don’t even start to think that you can work around this by using securefiles.

P.P.P.S. I got an hint from one test that if a reusable LOB is exactly the same size as the LOB being inserted then Oracle very cleverly takes the entry index entry and rewrites it to be the LOB index entry rather than freeing (and then potentially using) the space it identifies.

 


Basicfile LOBS 5

Sun, 2016-09-11 12:49

At the end of the last installment we had seen a test case that caused Oracle to add a couple of redundant new extents to a LOB segment after one process deleted 3,000 LOBs and another four concurrent processes inserted 750 LOBs each a few minutes later (after the undo retention period had elapsed). To add confusion the LOBINDEX seemed to show that all the “reusable” chunks had been removed from the index which suggests that they should have been re-used. Our LOB segment started at 8,192 blocks, is currently at 8,576 blocks and is only using 8,000 of them.

How will things look if I now connect a new session (which might be associated with a different freepool), delete the oldest 3,000 LOBs, wait a little while, then get my original four sessions to do their concurrent inserts again ? And what will things look like after I’ve repeated this cycle several times ?

I had to drop the tables from my original test since writing the previous article, so the following results start from recreating the whole test from scratch and won’t align perfectly with the previous sets of results. Here’s what the index treedump looked like after going through the serial delete / concurrent insert cycle 12 times:

----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 71, level: 1)
   leaf: 0x1800223 25166371 (-1: nrow: 0 rrow: 0)
   leaf: 0x1800227 25166375 (0: nrow: 0 rrow: 0)
   leaf: 0x1800236 25166390 (1: nrow: 0 rrow: 0)
   leaf: 0x180023d 25166397 (2: nrow: 63 rrow: 63)
   leaf: 0x1800206 25166342 (3: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (4: nrow: 81 rrow: 81)
   leaf: 0x1800229 25166377 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (11: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (12: nrow: 126 rrow: 126)

   leaf: 0x1800266 25166438 (13: nrow: 0 rrow: 0)
   leaf: 0x180025e 25166430 (14: nrow: 39 rrow: 39)
   leaf: 0x1800262 25166434 (15: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (16: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (17: nrow: 76 rrow: 76)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x180024a 25166410 (24: nrow: 118 rrow: 118)

   leaf: 0x1800263 25166435 (25: nrow: 0 rrow: 0)
   leaf: 0x180024c 25166412 (26: nrow: 0 rrow: 0)
   leaf: 0x1800254 25166420 (27: nrow: 0 rrow: 0)
   leaf: 0x1800264 25166436 (28: nrow: 1 rrow: 0)
   leaf: 0x1800274 25166452 (29: nrow: 2 rrow: 0)
   leaf: 0x180027c 25166460 (30: nrow: 2 rrow: 0)
   leaf: 0x180025d 25166429 (31: nrow: 2 rrow: 0)
   leaf: 0x1800241 25166401 (32: nrow: 2 rrow: 0)
   leaf: 0x1800245 25166405 (33: nrow: 2 rrow: 0)
   leaf: 0x1800265 25166437 (34: nrow: 1 rrow: 0)
   leaf: 0x1800251 25166417 (35: nrow: 3 rrow: 0)
   leaf: 0x1800249 25166409 (36: nrow: 4 rrow: 0)
   leaf: 0x1800242 25166402 (37: nrow: 1 rrow: 0)
   leaf: 0x1800255 25166421 (38: nrow: 2 rrow: 0)
   leaf: 0x1800259 25166425 (39: nrow: 3 rrow: 0)
   leaf: 0x1800246 25166406 (40: nrow: 1 rrow: 0)

   leaf: 0x1800214 25166356 (41: nrow: 38 rrow: 0)
   leaf: 0x1800218 25166360 (42: nrow: 81 rrow: 0)
   leaf: 0x180021c 25166364 (43: nrow: 81 rrow: 0)
   leaf: 0x1800220 25166368 (44: nrow: 0 rrow: 0)
   leaf: 0x180022d 25166381 (45: nrow: 26 rrow: 26)
   leaf: 0x1800231 25166385 (46: nrow: 81 rrow: 81)
   leaf: 0x1800219 25166361 (47: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (48: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (49: nrow: 81 rrow: 81)
   leaf: 0x180022c 25166380 (50: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (51: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (52: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (53: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (54: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (55: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (56: nrow: 79 rrow: 79)
   leaf: 0x180023f 25166399 (57: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (58: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (59: nrow: 81 rrow: 81)
   leaf: 0x1800210 25166352 (60: nrow: 120 rrow: 120)

   leaf: 0x180021d 25166365 (61: nrow: 0 rrow: 0)

   leaf: 0x1800248 25166408 (62: nrow: 21 rrow: 21)
   leaf: 0x1800268 25166440 (63: nrow: 81 rrow: 81)
   leaf: 0x180026c 25166444 (64: nrow: 152 rrow: 152)
   leaf: 0x180026b 25166443 (65: nrow: 152 rrow: 152)
   leaf: 0x180026f 25166447 (66: nrow: 152 rrow: 152)
   leaf: 0x1800273 25166451 (67: nrow: 152 rrow: 152)
   leaf: 0x1800277 25166455 (68: nrow: 152 rrow: 152)
   leaf: 0x180027b 25166459 (69: nrow: 66 rrow: 66)
----- end tree dump

As usual I’ve split the treedump into the sections that reflect the freepools, each of which could consist of two parts the LOBs (key values starting with even numbers) and the “reusable chunks” (key values starting with odd numbers). The dump suggests that things have worked well: as you can see it’s grown a few blocks after my 12 cycles but there are only 6 sections (not the full 8 that might be there), and only a few leaf blocks showing “empty” (rrows = 0). As “reusable” sections have appeared the index has grown a little, then the reusable entries have been taken off the index and the index has shrunk a bit; you can even see that freepool 3 (the highest numbered one) is still showing a pattern of 152 LOBs indexed per block – this is despite the fact that at one point a reusable section for freepool 3 (00 07) appeared above this section and then disappeared as those reusable chunks were reclaimed.

All in all the index seems to be behaving extremely well, with only a little growth and (probably temporarily) a couple of little glitches of empty leaf blocks.

Here’s the dump of the (slightly edited) “col 0” values to confirm where the freepool breaks were –

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 2c cc
 1:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 31 61
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 33 61
 3:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 36
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 37 1e
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 38 37
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 39 1e
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3a 37
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3b 50
 9:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3c
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3d 4b
11:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3e
12:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3e e7

13:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 25 9b
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 32 a0
15:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 34 1d
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 36 c6
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 39 3d
18:     col 0; len  9; ( 9):  00 02 00 01 00 00 09 df 3d
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 3f 52
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 40 cf
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 20
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 71
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 c2
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 42 13

25:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
26:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
27:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
28:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
29:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
30:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
31:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
32:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
33:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
34:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
35:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
36:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
37:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
38:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
39:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
40:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00

41:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 26 52
42:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a 27
43:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a dc
44:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2b 2d
45:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 31 34
46:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 33 15
47:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 92
48:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 e3
49:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 34
50:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 85
51:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 d6
52:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 36 27
53:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 6c
54:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 ef
55:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3a d0
56:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3c 4d
57:     col 0; len  9; ( 9):  00 04 00 01 00 00 09 df 3d
58:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3e 4b
59:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3f 96
60:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 40 7d

61:     col 0; len 10; (10):  00 05 57 bc b9 db 00 00 00 00

62:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 32 5b
63:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 33 a6
64:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 36 4f
65:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 38 13
66:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3a 09
67:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3b cd
68:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3d 91
69:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3f 23

As you can see at leaf block 61 we didn’t quite empty the reusable list from freepool 2 (00 05 = 2 * 2 + 1), and leaf blocks 25 to 40 tell us that freepool 1 (00 03 = 2 * 1 + 1) was the freepool used on the last big delete. Despite the odd little glitches it looks as if this strategy of “deleted LOBs go to my process’ freepool” seems to do a good job of reusing index space.

But there IS a problem. Here’s the output from a script I wrote using the dbms_space package to show how space in the LOB segment has been used:


Unformatted                   :    4,508 /   36,929,536
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :    8,000 /   65,536,000

Segment Total blocks: 12672
Object Unused blocks: 0

The LOB segment has grown from an initial 8,192 blocks with a few unformatted blocks and 8,000 used blocks (2 blocks per LOB, 4,000 LOBs) to 12,672 blocks with 4,508 blocks unformatted. (The difference between Full + Unformatted and Segment Total blocks is the set of bitmap space management blocks for the segment). After only 12 cycles we have “leaked” an overhead of 50% of our real data space – maybe this helps to explain why the client that started me down this set of blogs has seen Oracle allocate 700GB to hold just 200GB of LOBs.

The tablespace is declared as locally managed with 1MB uniform extents and automatic segment space management. By writing a simple script I can get Oracle to write a script to dump the first block of each extent – and they will all be Level 1 bitmap space management blocks. Running grep against the trace file I can pick out the lines that tell me how many data blocks are mapped by the bitmap and how many of them have been formatted or not. This is the result (I have 99 extents in the segment – 99 * 128 = 12,672):

   unformatted: 0       total: 64        first useful block: 4
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 24      total: 64        first useful block: 2
   unformatted: 52      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 127     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 108     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 105     total: 128       first useful block: 1
   unformatted: 96      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 38      total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 65      total: 128       first useful block: 1
   unformatted: 98      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1

You’ll notice that for the first 63 extents Oracle says there are 64 blocks mapped by the bitmap with the first useful block at block 2, thereafter it says there are 128 blocks and the first useful block is block 1 (Oracle is counting from zero). While Oracle thinks the segment is “quite small” it allocates two level 1 bitmap blocks per 1MB extent and I’ve only dumped the first block from each extent; but when the segment reaches 64MB Oracle decides that it’s getting pretty big and there’s no point in wasting space so changes to using a single level 1 bitmap block per 1MB extent. It’s just one of those tiny details you discover when you happen to look a little closely and how things work. (On a much larger test with 8MB uniform extents Oracle got to the point where it was using one L1 bitmap block for the whole 8MB.)

There’s a fascinating pattern in the later extents of 3 full extents followed by 3 empty extents – my first guess had been that Oracle was allocating new extents but not using them, but clearly that’s not right, it’s removing “reusable chunks” from the index and then not re-using them but using the new extents instead (some of the time). Something is seriously wrong with the way Oracle is handling the “reusable chunks” part of the index. With a little luck it’s some nasty side effect of the “one process delete / multiple process insert” strategy we have adopted, so: (a) we need to repeat the entire experiment with a concurrent delete mechanism and (b) we need to think about how we might re-engineer a REALLY BIG system that has followed this unfortunate strategy for a long time. Of course if (a) turns out to be a disaster as well we don’t really need to think too hard about (b) until we have discovered a good way of dealing with our rolling pattern of inserts and deletes.

Some (minimum effort, we hope) ideas we will have to look at for (b):

  • Oracle has an option to change the freepools count on a LOB segment – do we need to use it, how much work would it entail, would it require downtime
  • Oracle has an option to “rebuild” the freepools on a LOB segment
  • We can always try the “shrink space compact” option on the LOB
  • Should we just rebuild (move) the LOB segment – and use a larger extent size while we’re at it
  • Should we recreate the table and change the LOB to Securefiles as we do so – and do all the testing all over again
  • If we’re deleting old data on such a regular pattern should we try to bypass the deletes by partitioning the table in some clever way

TO BE CONTINUED.

 


Space Usage

Sun, 2016-09-11 12:09

Here’s a simple script that I’ve used for many years to check space usage inside segments.  The comment about freelist groups may be out of date  – I’ve not had to worry about that for a very long time. There is a separate script for securefile lobs.


rem
rem     Script:         dbms_space_use.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2002
rem     Purpose:
rem
rem     Last tested 
rem             12.1.0.1
rem             11.2.0.4
rem             11.1.0.7
rem             10.2.0.3
rem              9.2.0.8
rem     Not tested
rem             10.2.0.5
rem     Not relevant
rem              8.1.7.4
rem
rem     Notes:
rem     For accuracy in free space you (once) needed to set the
rem     scan limit; and for those rare objects cases where you 
rem     had defined multiple freelist groups you still have to
rem     work through each free list group in turn
rem
rem     For the ASSM calls:
rem             FS1     => 0% - 25% free space
rem             FS2     => 25% - 50% free space
rem             FS3     => 50% - 75% free space
rem             FS4     => 75% - 100% free space
rem             Bytes = blocks * block size
rem
rem     Expected errors:
rem             ORA-10614: Operation not allowed on this segment
rem                     (MSSM segment, ASSM call)
rem             ORA-10618: Operation not allowed on this segment
rem                     (ASSM segment, MSSM call)
rem             ORA-03200: the segment type specification is invalid
rem                     (e.g. for LOBINDEX or LOBSEGMENT)
rem                     11g - "LOB" is legal for LOB segments
rem                         - use "INDEX" for the LOBINDEX
rem
rem     For indexes
rem             Blocks are FULL or FS2 (re-usable)
rem
rem     Special case: LOB segments.
rem     The number of blocks reported by FS1 etc. is actually the
rem     number of CHUNKS in use (and they're full or empty). So 
rem     if your CHUNK size is not the same as your block size the
rem     total "blocks" used doesn't match the number of blocks 
rem     below the HWM.
rem
rem     The package dbms_space is created by dbmsspu.sql
rem     and the body is in prvtspcu.plb
rem
rem     11.2 overloads dbms_space.space_usage for securefile lobs
rem     See dbms_space_use_sf.sql
rem


define m_seg_owner      = &1
define m_seg_name       = &2
define m_seg_type       = &3

define m_segment_owner  = &m_seg_owner
define m_segment_name   = &m_seg_name
define m_segment_type   = &m_seg_type

@@setenv

spool dbms_space_use

prompt  ===================
prompt  Freelist management
prompt  ===================

declare
        wrong_ssm       exception;
        pragma exception_init(wrong_ssm, -10618);

        m_free  number(10);
begin
        dbms_space.free_blocks(
                segment_owner           => upper('&m_segment_owner'),
                segment_name            => upper('&m_segment_name'),
                segment_type            => upper('&m_segment_type'),
--              partition_name          => null,
--              scan_limit              => 50,
                freelist_group_id       => 0,
                free_blks               => m_free
        );
        dbms_output.put_line('Free blocks below HWM: ' || m_free);
exception
        when wrong_ssm then
                dbms_output.put_line('Segment not freelist managed');
end;
/


prompt  ====
prompt  ASSM
prompt  ====

declare
        wrong_ssm       exception;
        pragma exception_init(wrong_ssm, -10614);

        m_UNFORMATTED_BLOCKS    number;
        m_UNFORMATTED_BYTES     number;
        m_FS1_BLOCKS            number;
        m_FS1_BYTES             number;
        m_FS2_BLOCKS            number;  
        m_FS2_BYTES             number;

        m_FS3_BLOCKS            number;
        m_FS3_BYTES             number;
        m_FS4_BLOCKS            number; 
        m_FS4_BYTES             number;
        m_FULL_BLOCKS           number;
        m_FULL_BYTES            number;

begin
        dbms_space.SPACE_USAGE(
                upper('&m_segment_owner'),
                upper('&m_segment_name'),
                upper('&m_segment_type'),
--              PARTITION_NAME                  => null,
                m_UNFORMATTED_BLOCKS,
                m_UNFORMATTED_BYTES, 
                m_FS1_BLOCKS , 
                m_FS1_BYTES,
                m_FS2_BLOCKS,  
                m_FS2_BYTES,
                m_FS3_BLOCKS,  
                m_FS3_BYTES,
                m_FS4_BLOCKS,  
                m_FS4_BYTES,
                m_FULL_BLOCKS, 
                m_FULL_BYTES
        );

        dbms_output.new_line;
        dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,990'));
        dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,990'));
        dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,990'));
        dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,990'));
        dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,990'));
        dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,990') || ' / ' || to_char(m_full_bytes,'999,999,990'));

exception
        when wrong_ssm then
                dbms_output.put_line('Segment not ASSM');
end;
.
/


prompt  =======
prompt  Generic
prompt  =======

declare
        m_TOTAL_BLOCKS                  number;
        m_TOTAL_BYTES                   number;
        m_UNUSED_BLOCKS                 number;
        m_UNUSED_BYTES                  number;
        m_LAST_USED_EXTENT_FILE_ID      number;
        m_LAST_USED_EXTENT_BLOCK_ID     number;
        m_LAST_USED_BLOCK               number;
begin
        dbms_space.UNUSED_SPACE(
                upper('&m_segment_owner'),
                upper('&m_segment_name'),
                upper('&m_segment_type'),
--              PARTITION_NAME                  => null,
                m_TOTAL_BLOCKS,
                m_TOTAL_BYTES, 
                m_UNUSED_BLOCKS,  
                m_UNUSED_BYTES,
                m_LAST_USED_EXTENT_FILE_ID, 
                m_LAST_USED_EXTENT_BLOCK_ID,
                m_LAST_USED_BLOCK
        );

        dbms_output.put_line('Segment Total blocks: '  || m_total_blocks);
        dbms_output.put_line('Object Unused blocks: '  || m_unused_blocks);

end;
.
/


spool off


Here’s a sample of output (from a segment using ASSM):


===================
Freelist management
===================
Segment not freelist managed

PL/SQL procedure successfully completed.

====
ASSM
====

Unformatted                   :  132,385 / ############
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   12,327 /  100,982,784

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145920
Object Unused blocks: 0

PL/SQL procedure successfully completed.

(Looks like I should have increased the length of the byte-count output over the years ;)


Parallel_index hint

Wed, 2016-08-31 07:28

Prompted by a recent OTN posting I’ve dug out from my library the following demonstration of an anomalty with the parallel_index() hint. This note is a warning about  how little we understand hints and what they’re supposed to mean, and how we can be caught out by an upgrade. We’ll start with a data set which, to match a comment made in the origina posting rather than being a necessity for the demonstration, has an index that I’ve manipulated to be larger than the underlying table:


rem
rem     Script:         parallel_index_hint_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          December 2005
rem

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1e4)         modded,
        lpad(rownum,10,'0')     v1,
        lpad('x',30,'x')        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(modded) pctfree 75;
alter table t1 modify modded not null;

Your figures may vary slightly if you try to recreate this model, but according to my stats blocks=7876 for the table and leaf_blocks=8054 for the index. Now I’m going to generate the execution plans for a simple query – count(modded) from the table – with a varying selection of hints.


select /*+ index_ffs(t1 t1_i1) */ count(modded) from t1;

select /*+ parallel_index(t1 t1_i1 20) */ count(modded) from t1;

select /*+ index_ffs(t1 t1_i1) parallel_index(t1 t1_i1 20) */ count(modded) from t1;

select /*+ parallel_index(t1 t1_i1 20) parallel(t1 10) */ count(modded) from t1;

Pause for thought

  • which of the four statements will have a plan that uses an index fast full scan ?
  • which of the four statements will have a plan that indicates parallel execution ?
  • which of the four statements will indicate a parallel index fast full scan ?
  • why are two of the plans going to be identical but with different costs ?

The most interesting bit of this note is in the last question because it’s also the answer to a more subtle “why didn’t Oracle do what I thought it should” question. Here are the four plans I got from an instance of 11.2.0.4 (with a little labelling to remind us about the hinting):


==================================================
index_ffs() on it's own - get index fast full scan
==================================================

-----------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Cost (%CPU)| Time     |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |  1043   (4)| 00:00:06 |
|   1 |  SORT AGGREGATE       |       |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|  1043   (4)| 00:00:06 |
-----------------------------------------------------------------------

=====================================================
parallel_index() on it's own
Get serial tablescan which is cheaper than serial FFS
=====================================================

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |  1027   (4)| 00:00:06 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  1000K|  1027   (4)| 00:00:06 |
-------------------------------------------------------------------

========================================================
parallel_index() with index_ffs()
Get parallel fast full scan - at same cost as SERIAL FFS
========================================================

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |  1043   (4)| 00:00:06 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR    |          |  1000K|  1043   (4)| 00:00:06 |  Q1,00 | PCWC |            |
|   6 |       INDEX FAST FULL SCAN| T1_I1    |  1000K|  1043   (4)| 00:00:06 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------

===============================================
parallel_index() with parallel()
Get parallel fast full scan - costed correctly.
(Not costed at the degree given for table).
===============================================

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |    58   (4)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR    |          |  1000K|    58   (4)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       INDEX FAST FULL SCAN| T1_I1    |  1000K|    58   (4)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------

First plan – we supplied the index_ffs() hint, it’s valid: so Oracle did an index fast full scan. The serial cost was 1,043.

Second plan – we supplied the parallel_index() hint: but the optimizer apparently didn’t consider the cost of a parallel index fast full scan (which ought to have cost something like (1043/20)/0.9 = 58; instead it chose the serial tablescan at a cost of 1027 (cheaper than a serial index fast full scan because of the way I had defined a very large index).

Third plan – we supplied the parallel_index() hint with an explicit index_ffs() hint: both were legal so the optimizer obeyed the hints and produced a parallel index fast full scan (which is what we might have hoped would happen automatically for the second query). But the cost of the query is 1,043 – the cost of the serial index fast full scan.

Final plan – we didn’t hint an index_ffs() we hinted parallel() and parallel_index(): “strangely” Oracle has selected the parallel index fast full scan – and this time the cost is 58 (which, as I pointed out above, is (1043/20)/0.9, the value it’s “supposed” to be). It looks as if Oracle uses the parallel scaling factor on the index only if the table is ALSO hinted (or defined) to be treated to parallel execution.

Note, by the way, that I’ve hinted the table parallel 10 so that the cost of the parallel tablescan (ca. (1027/10)/0.9 = 114) will be larger than the cost of the parallel index fast full scan. Normally you’re likely to have the same degree hinted for table and index, and the index is likely to be smaller than the table which means that if you see this anomaly in production it’s more likely to be a case of Oracle using a parallel tablescan when you’re expecting a parallel index fast full scan. Oracle will have compared the serial cost of an index fast full scan against the parallel cost of the table.

Footnote:

When I ran the script on 12c there was one important change. The cost of the “correctly costed” index fast full scan came out at 103 rather than 58. Oracle has used the degree from the parallel() hint for the tablescan to calculate the cost of the parallel index fast full scan. If you are unlucky enough to have some code that has conflicting degrees hinted (or defined) across tables and indexes then you may see some plans change because parallel index fast full scans suddenly change their cost.

Addendum (prompted by Randolf Geist’s comment): The plan in 12c also had a Note: “Degree of Parallelism is 4 because of table property”. This was a little odd because the degree of the table was 1 and the hint said 10, but the arithmetic of the tablescan had ( as indicated) used a DOP of 4 – despite reporting Degree=0 in the 10053 trace file. The only way I could get the number 4 was to look at the CPU parameters – bearing in mind the critical parallel parameters:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     2
parallel_threads_per_cpu             integer     2

parallel_adaptive_multi_user         boolean     TRUE
parallel_automatic_tuning            boolean     FALSE
parallel_degree_limit                string      CPU
parallel_degree_policy               string      MANUAL
parallel_io_cap_enabled              boolean     FALSE

The anomaly of “degree 10” appearing in the costing for the index fast full scan was echoed in the 10053 trace: the file reported “Degree: 20”, then showed the cost being divided by 10.

At run-time, though, the query DID run parallel 20 when hinted with both the parallel() and parallel_index() hints, and ran parallel 10 (while costing and repeating the note about parallel 4) when modified to have only the /*+ parallel(t1 10) */ hint.


Basicfile LOBs 4

Wed, 2016-08-24 13:02

At the end of the previous installment we saw that a single big batch delete would (apparently) attach all the “reusable” chunks into a single freepool, and asked the questions:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

(Okay, I’ll admit it, the third question is a clue about the answer to the second question.)

I find that this process of asking “what’s good, what’s bad, what could possibly go wrong” is an excellent way to prompt thoughts about why Oracle Corp. might have chosen a particular strategy and what that means in terms of the best (or expected) use of the feature and worst threats from misuse of the feature. So lets’s see what thoughts we can come up with.

  • Good idea: The only alternative to using a single freepool when you make chunks reusable is to spread the chunks uniformly across all the freepools – either putting the chunks onto the same free pool that the LOB was previously attached to or doing some sort of round-robin. If you go for either of these fair-share strategies you increase the amount of contention on LOB deletes if many users are deleting at the same time – which sounds like someething you might want to avoid, but LOBs are supposed to be fairly static (somewhere on MoS there’s a note that says the expected behaviour is pretty much: “we thought you’d write once, read many, and not update”) so surely a small amount of contention shouldn’t be a big problem
  • Bad idea: As mentioned in a previous post, it looks like the freepool picked by a process is dependent on the process id – so if you happen to have just a couple of processes doing large deletes they might, coincidentally, pick the same freepool and end up constantly contending with each other rather than drifting in and out of collisions. If, as often happens with archive-like processes, you use one or two processes to delete a large fraction of the data you end up with one or two freepools holding lots of reusable space and all the other freepools holding no freespace – which brings us to the third question.
  • What happens next: Let’s say 3% of your LOB (one day out of a month) is currently “reusable chunks” and the chunks are all attached to the same freepool; your process connects to insert some new LOBs and its process id identifies the wrong freepool. There are no free blocks below the highwater mark and the retention limit is long gone. Does your process (a) add an extent to create some more free space (this is the type of thing that used to happen with manual segment space management, freelist groups and freelists for non-LOB tables and indexes) or (b) start stealing from another freepool that has reusable chunks. In either case what’s going to happen in the longer term ?
  • What happens even later: Imagine you have 28 days of data and use a single process to delete data on the 29th day. For reasons of concurrency you have been running with freepools 20. If option (a) applies then (assuming everything works perfectly) at steady state you will end up with roughly 20 days worth of reusable chunks spread across your 20 freepools before the system stabilises and stops adding unnecessary extents; if option (b) applies then (assuming everything works perfectly) every night you put a load of reusable chunks on one freepool and all through the day your 20 processes are fighting (at the oldest end of the index) to reuse those chunks. I said in an earlier installment that multiple freepools got rid of “the two hot spots” – this single thread deletion strategy has just brought one of them back.

So what really happens ? By the end of the last installment I had deleted the oldest 3,000 LOBs and found them attached as reusable chunks in freepool 2 with several consecutive “empty”  (nrows=81, rrows=0) leaf blocks at the low end of all the other pools.  After running my 4 concurrent processes to insert 750 rows each (i.e. insert the replacements for the 3,000 rows I’ve deleted) this is what the index treedump looks like (with a little editing to show the main breaks between freepools):


----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 60, level: 1)
   leaf: 0x180020e 25166350 (-1: nrow: 22 rrow: 22)
   leaf: 0x1800212 25166354 (0: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (1: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (2: nrow: 74 rrow: 74)
   leaf: 0x1800239 25166393 (3: nrow: 81 rrow: 81)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (7: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (8: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (9: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (10: nrow: 78 rrow: 78)
   leaf: 0x1800232 25166386 (11: nrow: 151 rrow: 151)
---
   leaf: 0x1800226 25166374 (12: nrow: 0 rrow: 0)
   leaf: 0x180020f 25166351 (13: nrow: 64 rrow: 64)
   leaf: 0x1800213 25166355 (14: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (15: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (16: nrow: 81 rrow: 81)
   leaf: 0x1800265 25166437 (17: nrow: 81 rrow: 81)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x1800242 25166402 (24: nrow: 122 rrow: 122)
---
   leaf: 0x1800229 25166377 (25: nrow: 0 rrow: 0)
   leaf: 0x1800214 25166356 (26: nrow: 36 rrow: 36)
   leaf: 0x1800230 25166384 (27: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (28: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (29: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (30: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (31: nrow: 75 rrow: 75)
   leaf: 0x1800231 25166385 (32: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (33: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (34: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (35: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (36: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (37: nrow: 134 rrow: 134)
---
   leaf: 0x1800215 25166357 (38: nrow: 1 rrow: 0)
   leaf: 0x180026e 25166446 (39: nrow: 4 rrow: 0)
   leaf: 0x180021b 25166363 (40: nrow: 1 rrow: 0)
   leaf: 0x180024b 25166411 (41: nrow: 2 rrow: 0)
   leaf: 0x1800276 25166454 (42: nrow: 2 rrow: 0)
   leaf: 0x180024f 25166415 (43: nrow: 0 rrow: 0)
   leaf: 0x180027e 25166462 (44: nrow: 4 rrow: 0)
   leaf: 0x1800221 25166369 (45: nrow: 0 rrow: 0)
   leaf: 0x180027a 25166458 (46: nrow: 0 rrow: 0)
---
   leaf: 0x1800218 25166360 (47: nrow: 0 rrow: 0)
   leaf: 0x180021c 25166364 (48: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (49: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (50: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (51: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (52: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (53: nrow: 152 rrow: 152)
   leaf: 0x1800253 25166419 (54: nrow: 152 rrow: 152)
   leaf: 0x1800257 25166423 (55: nrow: 152 rrow: 152)
   leaf: 0x180025b 25166427 (56: nrow: 152 rrow: 152)
   leaf: 0x180025f 25166431 (57: nrow: 152 rrow: 152)
   leaf: 0x1800263 25166435 (58: nrow: 1 rrow: 1)
----- end tree dump

Highlights

The number of leaf blocks has dropped from 72 to 60 – I didn’t think that this could happen without an index coalesce or rebuild, but maybe it’s a special feature of LOBINDEXes or maybe it’s a new feature of B-trees in general that I hadn’t noticed. Some of the “known empty” leaf blocks seem to have been taken out of the structure.

We still see the half full / full split between the leaf blocks for the first 3 freepools when compared to the top freepool.

There are still some empty leaf blocks (rrow = 0), but apart from the top freepool no more than one per freepool for the other sections that are indexing LOBs.

The section of index that is the freepool 2 section for “reusable” chunks shows an interesting anomaly. There are some leafblocks that are now empty (rrow=0) but were only holding a few index entries (nrow=1-4 rather than the 75 – 140 entries that we saw in the previous installment) at the moment they were last updated; this suggests a certain level of contention with problems of read-consistency, cleanout, and locking between processes trying to reclaim reusable blocks.

It’s just slightly surprising the the top freepool shows several empty leaf blocks – is this just a temporary coincidence, or a boundary case that means the blocks will never be cleaned and re-used; if it’s a fluke will a similar fluke also reappear (eventually) on the other freepools. Is it something to do with the fact that freepool 2 happened to be the freepool that got the first lot of reusable chunks ? Clearly we need to run a few more cycles of deletes and inserts to see what happens.

We have one important conclusion to make but before we make it let’s look at the partial key “col 0” values in the row directory of the root block just to confirm that the breaks I’ve listed above do correspond to each of the separate freepool sections:

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
 1:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0b
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc
 3:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0d
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0d 51
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 db bf f4
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c0 77
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c1 90
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 77
 9:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 fa
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c4 45
11:     col 0; len ..; (..):  00 00 00 01 00 00 09 db c5

12:     col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
13:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
15:     col 0; len ..; (..):  00 02 00 01 00 00 09 db 0b
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 0b ee
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 db bf 8b
18:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c0 a4
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c2 21
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c3 6c
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c4 21
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c5 9e
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 53
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 d6

25:     col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
26:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
27:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
28:     col 0; len ..; (..):  00 04 00 01 00 00 09 db 0b
29:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d
30:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc 64
31:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc b5
32:     col 0; len ..; (..):  00 04 00 01 00 00 09 db bd
33:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd 51
34:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd a2
35:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd f3
36:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 44
37:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 95

38:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
39:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
40:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
41:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
42:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
43:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
44:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
45:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
46:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

47:     col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
48:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
49:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
50:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
51:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
52:     col 0; len ..; (..):  00 06 00 01 00 00 09 db 0a
53:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a
54:     col 0; len 10; (10):  00 06 00 01 00 00 09 db bf da
55:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c1 6c
56:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c2 cc
57:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c4 90
58:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c6 22

I’ve broken the list and numbered the entries to match the treedump above, so it’s each to check that leaf blocks 38 to 46 are the now empty blocks for the reusable chunks. We started the reload with 3,001 entries for reusable chunks all in one freepool; we’ve ended it with none. Something has “stolen” the reusable chunks from freepool 2 so that they could be used for creating new LOBs that were distributed across all the freepools.

Oracle has been very efficient about re-using the index space, with a little bit of wastage creeping in, perhaps caused by coincidences in timing, perhaps by some code that avoids waiting too long when trying to lock index entries. We have a contention point because of the single threaded delete – but it doesn’t appear to be a disaster for space utilisation. Of course we need to look at the level of contention, and repeat the cycle a few times, changing the freepool used for deletion fairly randomly to see if we just got lucky or if the first few deletes are special cases. We can also ask questions about how the “stealing” takes place – does a process steal one index entry at a time, or does it take several consecutive index entries from the same block while it’s got the leaf block locked – but perhaps we don’t really need to know the fine details, the amount of time spent in contention (typically TX waits of some sort) could tell use whether or not we had a significant problem.

Contention and Resources

For each of the processes running the inserts I took a couple of snapshots – session stats and wait events – to see if anything interesting showed up. Naturally, the closer you look the more strange things you find. Here are a few sets of numbers from v$session_event and v$sesstat (in my snapshot format – with the four sessions always reported in the same order);

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: HW - contention                                985           0          93.15        .095           1
enq: HW - contention                                 10           0           5.46        .546           1
enq: HW - contention                              1,001           0         102.27        .102           1
enq: HW - contention                              1,010           0         106.12        .105           1

db file sequential read                           1,038           0          40.75        .039           2
db file sequential read                              39           0           3.21        .082           1
db file sequential read                           1,038           0          28.33        .027           1
db file sequential read                           1,046           0          34.89        .033           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           1,038
physical reads direct                                                      979

physical reads                                                              39
physical reads direct                                                       19

physical reads                                                           1,038
physical reads direct                                                      998

physical reads                                                           1,046
physical reads direct                                                    1,005

session logical reads                                                  114,060
session logical reads                                                   22,950
session logical reads                                                  104,555
session logical reads                                                   93,173

data blocks consistent reads - undo records applied                      2,165
data blocks consistent reads - undo records applied                        119
data blocks consistent reads - undo records applied                      1,222
data blocks consistent reads - undo records applied                        193

My first thought when looking at the wait events was to get an idea of where most of the time went, and I had expected the HW enqueue to be the most likely contender: this enqueue is held not only when the high water mark for a segment is moved, it’s also held when a process is doing any space management for inserting a LOB. So my first suprise was that one session was hardly waiting at all compared to the other sessions.

Then I noticed that this one session was also suffering a very small number of “db file sequential read” waits compared to every other session – but why were ANY sessions doing lots of db file sequential reads: the LOB was declared as nocache so any reads ought to be direct path reads and although Oracle doesn’t always have to wait for EVERY direct path read we should have read (and rewritten) 1,500 “reusable” LOB chunks by direct path reads in each session – I refuse to believe we never waited for ANY of them. So take a look at the session stats: which show us the that the “db file sequential read” waits match exactly with the “physical reads” count but most of the “physical reads” are recorded “physical reads direct” – Oracle is recording the wrong wait event while reading the “reusable” chunks.

Okay, so our direct path read waits are being recorded incorrectly: but one session does hardly any physical reads anyway – so what does that mean ? It means the process ISN’T reusing the chunks – you can’t be reusing chunks if you haven’t t read them. But the dumps from the index tell us that all the reusable chunks have been reused – so how do we resolve that contradiction ? Something is reading the index to identify some reusable chunks, wiping the reference from the index, then not using the chunks so (a) we’ve got some reusable chunks “going missing” and (b) we must be allocating some new chunks from somewhere – maybe bumping the high water mark of the segment, maybe allocating new extents.

Fortunately I had used the dbms_space package to check what the lob segment looked like after I had loaded it. It was 8192 blocks long, with 66 blocks shown as unused and 8,000 (that’s exactly 2 blocks/chunks per LOB) marked as full. After the delete/insert cycle is was 8,576 blocks long, with 8,000 blocks marked as full and 444 marked as unused. We had added three extents of 1MB each that we didn’t need, and one session seems to have avoided some contention by using the new extents for (most of) its LOBs rather than competing for the reusable space with the other LOBs.

Was this a one-off, or a repeatable event. How bad could it get ?

TO BE CONTINUED.

Post-script

Is there a way of discovering from SQL (perhaps with a low-cost PL/SQL function) the freepool for a LOB when it’s defined as Basicfile. You can get the LOBid for a Securefiles LOB using the dbms_lobutil package and the LOBid includes the critical first two bytes – but the package is only relevant to Securefiles. I rather fancy the idea of a process knowing which freepool it is associated with and only deleting LOBs that come out of that freepool.

 


Basicfile LOBS

Mon, 2016-08-22 11:56

In the previous article in this mini-series I described how the option for setting freepools N when defining Basicfile LOBs was a feature aimed at giving you improved concurrency for inserts and deletes that worked by splitting the LOBINDEX into 2N sections: N sections to index the current LOB chunks by LOB id, alternating with N sections to map the reusable LOB chunks by deletion time.

In this article we’ll look a little further into the lifecycle of the LOB segment but before getting into the details I’ll just throw out a couple of consequences of the basic behaviour of LOBs that might let you pick the best match for the workload you have to deal with.

  • If you have enabled storage in row the first 12 chunks of a lob will be identified by the LOB Locator stored in the row, so if all your LOBs are sized between 4KB and 96KB (approximately) the LOB Index will consist only of entries for the reusable LOB space due to deleted LOBs even though the LOBs themselves will be stored out of line. This makes it look like a good idea to enable storage in row even when you expect all of your (smallish) LOBs to be stored out of row.
  • It’s quite nice to cache LOBs (at least for a little while) if your pattern of use means you access a specific LOB for a little while before it ceases to be interesting; but LOBs can swamp a large fraction of the buffer cache if you’re not careful. If you expect to follow this pattern of behaviour you might define a RECYCLE cache and then assign the LOB to that cache so that you get the benefits of caching while still protecting the main volume of your buffer cache.
  • Depending on the expected size of your LOBs you may have a good justification for creating a tablespace of a non-standard size for the LOB segment so that it takes fewer block reads to read the entire LOB. If (for example) you have a LOB which is always in the range of 62KB then a tablespace with a blocksize of 32KB would be a good choice because the LOB could be read with just two block reads. A fringe benefit of the non-standard block size, of course, is that you have to define a non-standard cache, which separates the LOB activity from the rest of the buffer cache. (Note: Oracle reads LOBs one chunk at a time, so the number of LOB reads – as opposed to block reads – for a 32KB chunk is the same whether the block size is 8KB or 32KB)
  • If you’re going to be logging your LOBs then remember that nocache LOBs will write entire chunks into the redo logs – think about how much extra redo this might generate: it might be better to have a small recycle cache and cache your LOBS as cached LOBs are logged at the byte level. (You don’t want a 32KB block size, nocache, logging if your LOBs are all 33KB).
The LOB lifetime

Before deciding on the suitability of a feature the first thing to do is define what you’re trying to achieve so that you can think realistically about where the threats may be and what tests are going to be important – so I’m going to describe a scenario, then talk about what threats might appear based on the current details I’ve given about Basicfile LOBs and freepools.

  • We have many processes inserting “small” (16KB to 24KB) LOBs concurrently in bursts during the day.
  • Typically we peak at about 20 processes inserting at the same moment, and we end up with about 100K new LOBs per day though this varies between 50K and 200K.
  • The inserts are all “insert one row; commit”.
  • The LOBs have to be kept for 28 days, after which they (the rows that hold them) are deleted by an overnight batch job.
  • The LOBs have to be logged and the database is running in archivelog mode

As soon as you see the “aged 28 days” you might immediately think “partitioning” (though perhaps your first thought might be that restaurant in Cincinnati airport that hangs its beef to air-dry for 28 days before cooking). Unfortunately not everyone has licensed the partitioning option, so what do you have to worry about when you start to design for this requirement. (We’re also going to assume that securefiles are going to be saved for another blog mini-series).

Clearly we should make use of multple freepools to avoid the insert contention on the LOBINDEX. With about 20 concurrent processes we might immediate go for freepools 20, but we might decide that a smaller number like 4 or 8 is sufficient. We probably ought to do some tests to see if we can discover any penalties for larger numbers of freepools, and to see what sort of contention we get with a smaller number of freepools.

We got a hint from the previous article that when a process deletes a LOB it indexes the reusable chunks in the same freepool as it inserts LOBs – at least, that’s what seemed to happen in our little test case in the previous article. Does Oracle always follow this pattern, or will a multi-row delete, or a large number of single “delete;commt;” cycles spread the reusable chunks evenly across all the available freepools ? If you do a single large delete do you end up with all the reusable space in one freepool – if so, does it matter or should we have multiple processes do our “big batch delete” ?

On second thoughts, my little demo showed that when you insert a LOB into freepool X and then delete it the reusable space goes into freepool X. Maybe I’ve misinterpreted the test and need to do a better test; maybe the reusable space goes into the freepool that the LOB was originally attached to, not into the freepool dictated by the process id. That would mean that a bulk delete would tend to spread the LOBs across all the freepools – which means if you used multiple processes to delete data they might cause contention on the “reusable” segments of the LOBINDEX.

If we do a single large delete and all the reusable chunks go into the same freepool what happens when we start inserting new LOBs ? If the LOB segment is “full” is it only the processes associated with that one freepool that can use the reusable space, or will EVERY process start to raid the freepool that has the only reusable space If the latter then all we’ve done by using multiple freepools is postpone (by roughly 28 days) the moment when we start to get contention on our LOBINDEX ?

Fortunately if we’ve made some poor choices in the orginal design Oracle does allow us to “rebuild freepools”, and even change the number of freepools:


alter table t1 modify lob (c1) (rebuild freepools);
alter table t1 modify lob (c1) (freepools (3));

Mind you, there is a little note on MoS that rebuilding freepools “may take some time” and locks the table in exclusive mode while it’s going on. So perhaps we should check to see how the rebuild works, and try to figure out how long it might take. A maxim for dealing with very large objects is that you really want to get it right first time because it’s hard to test the effects of change especially since you probably end up wanting to do your final tests on a backup copy of the production system.

Getting Started

I’ve specified 100K LOBs per day, sized between 16KB and 24KB, kept for 28 days – that’s about 50 GB, and I don’t really want to sit waiting for Oracle to build that much data while running 20 concurrent processes that are logging and generating archived redo log. (Especially since I may want to repeat the exercise two or three times with different numbers of freepools.) I’m going to start small and grow the scale when it’s necessary.

I’ll start with 4 concurrent processes inserting 1,000 LOBs each, sized at 12KB, with freepools 4, and I’ll rig the system very carefully so that each process uses a different freepool. After that I’ll run a single batch delete to delete the first 3,000 LOBs – I’ll pick a process that ought to use freepool 1 or 2 (i.e. not 0 or 3, the “end” freepools); then I’ll repeat the insert cycle but insert just 750 LOBs per process. At various points in this sequence of events I’ll stop and dump some index blocks and look at some stats to see if I can spot any important patterns emerging.

Once I’ve got through that cycle I’ll decide what to do next – the first set of results may produce some important new questions – but I’m guessing that I’ll probably end up repeating the “delete / insert” cycle at least one more time.

Here’s a little code to create a suitable table,


create sequence s1 cache 10000;

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile
    text_lob(
            disable storage in row
            chunk 8k
            retention
            nocache
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
;

I’ve inserted a row to make sure that all the objects appear in all the right places. The code I’ve used to do this insert is a version of the code that I’m going to use for the concurrency testing but restricted to insert one row with an id of zero. In the concurrency test I’ll make use of the sequence I’ve created to act as the primary key to the table.

Having created the table I then start four more sessions, carefully ensuring that they will each pick a different freepool. To make sure I had one session per freepool I just kept connecting sessions and running a silly little check for each session’s process id (pid) until I had four that returned each of the values from 0 to 3:

select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

Once I had the four extra sessions set up, I issued a simple “lock table t1 in exclusive mode” from my original session then started the following script in each of the other four:


spool temp&1

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..1000 loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

spool off

(I supplied A, B, C, and D as the first parameter to the script so that I got four sets of output, but I haven’t included the code I used to get a snapshot of the session stats, session waits, and system enqueues recorded by each session.)

First check – did I get all four freepools evenly used (which is what I had assumed would happen when I chose the 4 process ids so carefully. I can check this by doing a block dump of the LOBINDEX root block because with 4,001 entries I’m (almost certainly) going to get a root block, no further branch levels, and a few dozen leaf blocks.

As with all B-tree indexes the “row directory” of the root block will contain a list of “truncated” key values that allow Oracle to search down to the correct block in the next layer of the index so I’m going to extract just the key values, and only the first column of those keys in the same way that I did with the previous article. This means every line in the following output shows you, in order, the first LOB id (with a few of them truncated) in each leaf block:

col 0; len 10; (10):  00 00 00 01 00 00 09 da fe a7
col 0; len 10; (10):  00 00 00 01 00 00 09 db 00 24
col 0; len 10; (10):  00 00 00 01 00 00 09 db 01 6f
col 0; len 10; (10):  00 00 00 01 00 00 09 db 02 ec
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 04
col 0; len 10; (10):  00 00 00 01 00 00 09 db 05 7c
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 2b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 e0
col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 0b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc

col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
col 0; len 10; (10):  00 02 00 01 00 00 09 da fe 81
col 0; len 10; (10):  00 02 00 01 00 00 09 db 00 62
col 0; len 10; (10):  00 02 00 01 00 00 09 db 01 ad
col 0; len 10; (10):  00 02 00 01 00 00 09 db 02 94
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 f8
col 0; len 10; (10):  00 02 00 01 00 00 09 db 06 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 07 f2
col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
col 0; len  9; ( 9):  00 02 00 01 00 00 09 db 0b

col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
col 0; len 10; (10):  00 04 00 01 00 00 09 da fe 4c
col 0; len 10; (10):  00 04 00 01 00 00 09 da ff c9
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 01
col 0; len 10; (10):  00 04 00 01 00 00 09 db 01 f8
col 0; len 10; (10):  00 04 00 01 00 00 09 db 03 75
col 0; len 10; (10):  00 04 00 01 00 00 09 db 04 5c
col 0; len 10; (10):  00 04 00 01 00 00 09 db 06 3d
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 07
col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 0b

col 0; len  2; ( 2):  00 06
col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
col 0; len  9; ( 9):  00 06 00 01 00 00 09 db 0a
col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a

As you can see, we have the expected pattern (for 4 freepools) of entries starting with (00 00), (00 02), (00 04), and (00 06); but you might wonder why there are 11 leaf blocks for 00, 12 leaf blocks for 02 and 04, and only 8 leaf blocks for 06. We can answer the 11/12 anomaly by remembering that any branch blocks will have a “leftmost child” entry that won’t appear in the row directory – so the 12th leaf (or rather the 1st leaf) block for 00 is being pointed to by the “LMC”. But what about the missing blocks for 06 ? A treedump shows the answer:


branch: 0x1800204 25166340 (0: nrow: 44, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 81)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 75)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 132)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 81)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 81)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 81)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 81)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 81)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 81)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 81)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 81)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 111)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 78)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 81)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 76)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 81)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 81)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 107 rrow: 107)

   leaf: 0x180022b 25166379 (35: nrow: 78 rrow: 78)
   leaf: 0x1800218 25166360 (36: nrow: 152 rrow: 152)
   leaf: 0x180021c 25166364 (37: nrow: 152 rrow: 152)
   leaf: 0x1800220 25166368 (38: nrow: 152 rrow: 152)
   leaf: 0x1800224 25166372 (39: nrow: 152 rrow: 152)
   leaf: 0x1800228 25166376 (40: nrow: 152 rrow: 152)
   leaf: 0x180022c 25166380 (41: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (42: nrow: 11 rrow: 11)

Although there are little glitches along the way the leaf blocks for freepools 0, 1, and 2 have 81 entries per leaf block, the leaf blocks for freepool 3 have 152 – it’s the difference between inserting rows at the end of an index and getting a “90/10” split compared with inserting somewhere in the middle and getting a “50/50” split. In fact, although Oracle uses the idgen1$ sequence to drive the generation of LOB ids, the way it “batches” IDs (the sequence cache size is 50) means that LOB ids can appear out of order when generated by multiple sessions – even in the same freepool: it is possible for the top freepool to suffer 50/50 splits though these are likely to appear relatively rarely. (As a side effect, the LOB ids in my table and the id column on the underlying table are out of order with respect to each other.)

You’ll notice that I’ve left blank links in the treedump list corresponding to the breaks between the free pools (remember there is a “-1” entry in the treedump for the “leftmost child” that doesn’t appear in the row directory). This helps to explain the few leaf blocks with unusual row counts – they’re the ones that at (or very near) the boundaries between freepools.

Bulk Delete

After loading some data in with concurrent inserts and multiple freepools I’m going to do a single big delete from the table to get rid of a lot of “old” data. In fact all I’ll do is delete the rows where id <= 3000. The big question is this – will a single delete put all the reusable blocks into a single freepool, or will it put the reusable space for each LOB into the freepool that the LOB was originally in, or will it find some other way to spread the reusable space evenly across all the free pools ? One freepool or many – both options have good points, both options have bad points.

Here’s what I got as the treedump after the delete:


branch: 0x1800204 25166340 (0: nrow: 72, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 0)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 0)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 0)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 0)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 0)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 0)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 0)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 0)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 22)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 120)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 0)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 0)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 0)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 0)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 0)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 0)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 0)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 0)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 0)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 64)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 101)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 0)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 0)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 0)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 0)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 0)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 0)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 0)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 0)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 0)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 36)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (35: nrow: 139 rrow: 139)

   leaf: 0x1800227 25166375 (36: nrow: 138 rrow: 138)
   leaf: 0x1800205 25166341 (37: nrow: 126 rrow: 126)
   leaf: 0x1800219 25166361 (38: nrow: 82 rrow: 82)
   leaf: 0x1800272 25166450 (39: nrow: 95 rrow: 95)
   leaf: 0x1800209 25166345 (40: nrow: 118 rrow: 118)
   leaf: 0x180021f 25166367 (41: nrow: 143 rrow: 143)
   leaf: 0x180020d 25166349 (42: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (43: nrow: 90 rrow: 90)
   leaf: 0x1800222 25166370 (44: nrow: 147 rrow: 147)
   leaf: 0x1800211 25166353 (45: nrow: 81 rrow: 81)
   leaf: 0x1800247 25166407 (46: nrow: 73 rrow: 73)
   leaf: 0x1800223 25166371 (47: nrow: 98 rrow: 98)
   leaf: 0x180026a 25166442 (48: nrow: 98 rrow: 98)
   leaf: 0x180021d 25166365 (49: nrow: 127 rrow: 127)
   leaf: 0x1800266 25166438 (50: nrow: 131 rrow: 131)
   leaf: 0x1800215 25166357 (51: nrow: 133 rrow: 133)
   leaf: 0x180026e 25166446 (52: nrow: 141 rrow: 141)
   leaf: 0x180021b 25166363 (53: nrow: 82 rrow: 82)
   leaf: 0x180024b 25166411 (54: nrow: 93 rrow: 93)
   leaf: 0x1800276 25166454 (55: nrow: 109 rrow: 109)
   leaf: 0x180024f 25166415 (56: nrow: 77 rrow: 77)
   leaf: 0x180021e 25166366 (57: nrow: 143 rrow: 143)
   leaf: 0x180027e 25166462 (58: nrow: 126 rrow: 126)
   leaf: 0x1800221 25166369 (59: nrow: 93 rrow: 93)
   leaf: 0x1800253 25166419 (60: nrow: 82 rrow: 82)
   leaf: 0x180027a 25166458 (61: nrow: 97 rrow: 97)
   leaf: 0x1800257 25166423 (62: nrow: 84 rrow: 84)

   leaf: 0x180022b 25166379 (63: nrow: 78 rrow: 0)
   leaf: 0x1800218 25166360 (64: nrow: 152 rrow: 0)
   leaf: 0x180021c 25166364 (65: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (66: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (67: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (68: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (69: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (70: nrow: 11 rrow: 11)


The number of leaf blocks has gone up from 44 to 70 (but that shouldn’t be too much of a surprise – index leaf block space can’t be reused until after the commit, so we were bound to grow the index to insert the entries for reusable chunks).

As before I’ve inserted a few blank lines to break the list into the separate index sections, and you can see that the first few blocks in each of the first three freepools has nrow = 81 and (typically) rrow = 0. These are the leaf blocks where all the LOB entries have been marked as deleted. There are a couple of variations – leaf block 10, for example, shows nrow = 132, rrow = 120: this is the leaf block where freepool 0 (LOB section) overlapped with freepool 1 (LOB section), and the first 10 LOBs in freepool 1 have been marked as deleted. The LOB section for freepool 4 follows the same sort of pattern, though nrow = 152 in most of the blocks.

The important detail is in leaf blocks 36 to 62 – which show nrow = rrow throughout, but with a degree of randomness as to the actual number of index entries. These are the leaf blocks that record the “reusable chunks”, and they’ve all been associated with freepool 2 (counting from zero). There are several details that combine to explain why the numbers of entries per leaf block vary so much, but I don’t want to get too distracted by them now; remember, though, that I pointed out that the LOB ids and table id column weren’t in synch with each other so part of what you’re seeing here is 50/50 leaf node splits followed by a little back-filling.

Again I’ve extracted the “col 0” values from the block dump of the root block – I won’t show all of them, I’ll just show you the entries from entries 35 to 63 so that you can see the leaf block pointers for the “reusable” section of freepool 2, and the LOB section of freepools 2 and 3:


col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d

col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

col 0; len 02; (02):  00 06

The starting “00 05” tells us that this is the reusable chunks associated with freepool 2 (2 * 2 + 1 = 5), and the fact that the next four bytes are identical across the entries tells you that I managed to delete my 3,000 LOBs in less than one second.

After seeing the effect of this monolithic delete you should now be asking yourself a few questions, such as:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

TO BE CONTINUED …

 


Basicfile LOBs

Thu, 2016-08-18 07:21

There are probably quite a lot of people still using Basicfile LOBs, although Oracle Corp. would like everyone to migrate to the (now default) Securefile LOBs. If you’re on Basicfile, though, and don’t want (or aren’t allowed) to change just yet here are a few notes that may help you understand some of the odd performance and storage effects.

Of course, there are a lot of variations in how you declare the LOB – pctversion vs. retention, cache vs. nocache, logging vs. nologging, enable vs. disable storage in row, and I can’t cover all the combinations – so what I’ll be looking at is a general strategy for handling a large number of small LOBs that are being inserted into the database at a fairly high degree of concurrency, and then being deleted a few weeks later so, in theory, the LOB segment should end up at a steady state with “N” days worth of data stored. The driver behind this modelling is a problem I was asked to examine a little while ago.

Some background details on Basicfile LOBs

If the LOB column is defined as “enable storage in row” then a very small LOB (up to 3,960 bytes) will be stored almost as if it were an ordinary column in the row; if the size of a LOB is a little larger than this limit then the LOB will be stored in “chunks” in the LOB segment and pointers to the chunks will be stored in the row – for the first 12 chunks. The chunk size defined for a LOB column can be up to 32KB – though the default chunk size is the block size for the containing tablespace –  so it’s possible to store a LOB of up to roughly 384KB before Oracle needs to create index entries in the LOBINDEX segment if you’re using 32KB blocks though most people use 8KB blocks and will hit the change at roughly 96KB.

If the LOB column is defined as “disable storage in row” then no matter how small it really is it will always take up at least one chunk in the LOB segment and will have a corresponding index entry in the LOBINDEX.  For reasons of efficiency an entry in a LOB index always “carries” 32 bytes of pointer data, allowing it to list up to 8 chunks.

When a LOB is deleted (replaced by a null, an empty_lob(), or a new LOB value) the previous state of the base table row and the LOB index will be preserved in the undo segment in the ordinary manner but the previous version of the LOB data itself is simply left in the segment (and a new version of the LOB created if the operation is an “update”). The chunks making up the old version are added to the LOB index with a key based on the time (seconds since 1st Jan 1970) the delete took place – this means that when Oracle wants to re-use space in the LOB segment it can walk the LOB index in order to find the chunks that were marked as available for reuse the greatest time into the past. (It also means that the LOB index is one of the strangest in the Oracle pantheon – part of it indexes “available chunks keyed by time” part of it indexes “current chunks keyed by LOB id”.

There are two options for how long old versions of LOBs will be kept: PCTVERSION specifies the percentage of space below the segment’s highwater mark that may be used to keep old versions, and (until 12c, where things change) RETENTION specifies that Oracle should try to keep old versions for the length of time given by the system parameter undo_retention. If enough versions of LOBs have been kept Oracle can create a read-consistent version of a given LOB by using the normal undo mechanisms to take the base table row and LOB index back to the correct point in time which will then ensure that the LOB pointers will be pointing to the correct chunks.  (If the LOB chunks have been over-written this is the point where you will get an Oracle error: “ORA-22924 Snapshot too old”, followed by a misleading “ORA-01555 Snapshot too old ….”)

One final interesting point from a performance perspective is that you define the LOB to be “nocache”, which means that typical reads and writes of the lob will use direct path; and if you’ve specified nologging then reads and writes of the LOB will generate tiny amounts of redo log.  Two special points to go with this, though: if you specify “nocache logging” the direct path writes will be logged, but the log content will be by chunk – so if you store 4,000 bytes of data in a LOB with a 32KB chunk size you will write 32KB of redo log; secondly if you are testing the effects of logging and nologging, make sure your test database is running in archivelog mode of your production database is going to be archiving – otherwise Oracle will fool you by taking a short cut and NOT logging a nocache LOB even if you specify logging! The LOB index is always cached and logged, by the way, and even if the LOB is defined as nocache there are circumstances where LOB blocks are read into the buffer cache (remember my previous note explaining that we saw 6 billion buffer gets on a nocache LOB).

The last detail I want to mention is the FREEPOOLS parameter. The description in the developers guide for 11.2 describes this as:  “Specifies the number of FREELIST groups for BasicFiles LOBs, if the database is in automatic undo mode.” Unfortunately freelists and freelist groups are things that happen in manual segment space management, so this definition requires an alternative meaning for the expression “FREELIST groups”. The purpose of FREEPOOLS is to help deal with concurrency problems but there’s not much information around to help you understand the mechanisms and pitfalls and the available documents on MoS don’t really do anything to clarify the position – and that’s what this article is (finally) going to talk about.

Basicfile FREEPOOLs – the truth is out there

When you specify FREEPOOLs you affect the way Oracle uses the LOB Index – not the space management information about the segment holding the index, the actual content of (in fact the KEY values held by) the index.

You can do a treedump of a LOB index by object_id in the standard way you can do a treedump of any B-tree (or bitmap) index, and you can dump blocks from a LOB index in the same way you dump any other data block in the database, by file number and block number (or block range), so it’s easy to see what happens in a LOB index when you start using multiple freepools. I’ve created a table holding a LOB defined as “disable storage in row” so that I always use the LOB index, inserted three rows, then deleted one of them and dumped the one index block (which happens to be both the root and a leaf). Here’s the SQL to create the table and do the data handling:


create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile text_lob(
        disable storage in row
        chunk 8k
        retention
        nocache
        tablespace test_8k_assm
)
;


declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..3 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/


delete from t1 where id = 1;
commit;

alter system flush buffer_cache;

I’ve ended by flushing the buffer cache so that I don’t get a huge trace file when I try to dump the index to disc. Here’s the next bit of processing:


SQL> select object_id from user_objects where object_type = 'INDEX' and object_name like 'SYS_IL%';

 OBJECT_ID
----------
    241599

SQL> alter session set events 'immediate trace name treedump level 241599';

----- begin tree dump
leaf: 0x1800204 25166340 (0: nrow: 4 rrow: 3)
----- end tree dump

SQL> alter session dump datafile 6 block 516;

I’ve included in the above the treedump that I extracted from the tracefile, and this shows that the index consists of a single leaf block (0x1800204 = file 6 block 516) with 4 row directory entries of which one has been deleted. Here’s the row dump from that leaf block – the first three entries are the index entries identifying the three LOBs I created (and, as shown by the flag value “—D–“, the first has been marked as deleted) the fourth entry points to a set of free chunks (corresponding to the chunks that will becomem available for re-use after a delay corresponding to the undo retention time).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 a6 01 80 01 aa 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 ae 01 80 01 b2 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 b6 01 80 01 ba 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
 01 80 01 a6 01 80 01 aa 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00
col 1; len 4; (4):  01 80 01 a6

If you examine the entries closely you will see that despite the common structure of all four of them there are two patterns. Ignoring the “data (32):” portion and looking at just “col 0” the last few bytes of the first three entries hold consecutive numbers which are actually the LOB Ids for the three LOBs. The fourth entry breaks that pattern and if you examine bytes 3 to 6 you will find that that is (approximately, by the time I publish this article) the number of seconds since 1st Jan 1970.

To a large degree you need only examine “col 0” to get a good idea of how Oracle handles the LOB index, but I will say just a few things about the rest of the entry.  For the “reusable space” index entries “col 1” is the first of a list of up to 8 chunks that were released from the same LOB at that moment, and the “data(32)” is the full list of those 8 chunks – each chunk is identified by the block address of the first block of the chunk. If I had created and deleted a LOB of roughly 128KB I would have used 16 chunks to create it and made 16 chunks available on deletion, so there would have been two index entries with the same “col 0” value, each identifying 8 of the chunks – hence the re-appearance of the first chunk as “col 1”.  (You can’t help wondering why Oracle doesn’t squeeze 9 chunks per index entry rather than repeating the first of the list – maybe there’s a peripheral effect that makes 8 easier, maybe it’s simply a good sanity check mechanism.)

For index entries about current LOBs “col 1” is a counter for the index entries that identify the entire LOBs. Our LOBs were all very small so we only needed one index entry (which Oracle starts counting from zero).  The “data (32)” entry for the “zeroth” entry starts with 16 bytes of metadata then holds up to 4 pointers to chunks; subsequent entries don’t need the metadata and can hold up to 8 pointers each and “col 1” stores the chunk number that the index entry starts with, so “col 1” in consecutive index entries for a given LOB id will have values 0, 4, 12, 20, etc.

You might note, by the way, that my LOBs are not made up of consecutive blocks even though my chunk size is exactly one block. This is a side effect of ASSM (automatic segment space management) and nothing specifically to do with LOBs.

With this sketch in place you now have some idea of how a LOB index works. Apart from the convenience of knowing roughly what information is stored in the index, and how it has this strange dual purpose, you can now view it just like any other B-tree index in Oracle. When you insert a LOB you insert some index entries into the middle of the index (the high-value point of the LOB Id bit), when you delete a LOB you mark some (consecutive) index entries as deleted and insert some index entries at the high end of the index (because each delete is the most recent delete).

As soon as you’ve got that far you realise that if you have some degree of concurrency of inserts and deletes then you have two hot spots in the index – the mid point where you’re going to get lots of 50/50 splits as LOBs are inserted and the end point where all the re-usable chunks are indexed. So how do you configure Oracle, and what does Oracle do, to reduce the contention ?

Take another look at the “col 0” values – which I’ve cut out and listed in isolation below:


col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87

col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00

Apart from making it easy to see the sequencing in the 3 LOB ids it’s now easy to note that the first three (LOB) entries start with “00 00” while the last (reusable space) entry starts with “00 01”. It’s really this starting two bytes that makes it easy for Oracle to separate the current LOBs section of the index from the reusable space section. The two bytes are the freepool identifier – it’s the first (and only) free pool – but Oracle is counting from zero, doubling the counter for the current LOBs, and doubling and adding one for the reusable space.

Here are some results when I drop and recreate the table with freepools 4 and repeat the experiment. (I’ve removed the “data(32)” content to make the output a little cleaner, and then extracted the “col 0” values).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 55
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 55
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

It just happened that with 4 freepools available my session picked freepool 3 so its LOB index entries are preceded with 00 06 (2 * 3), and it’s reusable space index entries are preceded with 00 07 (2 * 3 + 1). At present I think freepool chosen by a session (counting from zero) is derived from the session’s process id (pid) by a simple mod(pid , freepools).

So what happens if I start a second session, and adjust my little PL/SQL procedure to insert rows 4, 5, and 6.

I expect to see two things. First, the “ordinary” B-tree event – the index entry that’s marked for deletion will be cleared out of the index; secondly I should see four new index entries (one marked as deleted) which, with a little luck (one chance in four), will show that they are associated with a different freepool.

Here’s the dump (again with the “data(32)” deleted, and the “col 0” extracted at the end):


row#0[7782] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 87
col 1; len 4; (4):  00 00 00 00

row#1[7732] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 88
col 1; len 4; (4):  00 00 00 00

row#2[7682] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 89
col 1; len 4; (4):  00 00 00 00

row#3[7632] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 01 57 b3 3b ad 00 00 00 00
col 1; len 4; (4):  01 80 01 a4

row#4[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#5[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#6[7832] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 87
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 88
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 89

col 0; len 10; (10): 00 01 57 b3 3b ad 00 00 00 00

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

The index entry previously marked as deleted has disappeared (it was LOB id “09 da 36 56”).

We have four new index entries – the first 4 in the list above – and we can see that our second session has been allocated to freepool 0, the LOB index entries are preceded by “00 00”, and the reusable space index entry is preceded by “00 01”.

So by declaring freepools N, we effectively break the index up into 2N nearly discrete sections. Half the sections get inserts at the high end as we insert new LOBs (with ever increasing LOB ids) and the other half (apart, sometimes, from the very top section) get inserts at the high end as time passes and we make LOB space available for reuse by deleting existing LOBs. (Note – if two LOBs of more than 8 chunks each are deleted in the same hundredth of a second then there index entries may end up interleaving as the full key is (timestamp, first chunk address) and the chunks may be scattered widely across the tablespace). Freepools allow Oracle to remove the two killer hot spots in the index.

There are side effects, of course: apart from the section for reusable space in the top freepool each section of the index will be subject to 50/50 block splits so the index will always be roughly twice the optimum size – plus a lot more due to other side effects of how the index is used if you’re constantly deleting and inserting LOBs. But that’s not really a big problem; I’ll be examining further side effects of the LOB index, and the mechanism that Oracle has for using the index, and the performance threats this introduces, in the next installment.

I got a bit rushed writing this so there may be some typos and grammar errors which I’ll correct over the next couple of days. Feel free to point out any that you notice.


Month End

Tue, 2016-08-16 07:04

A question about parallel query and cardinality estimates appeared on OTN a little while ago that prompted me to write this note about helping the optimizer do the best job with the least effort.  (A critical point in the correct answer to the original question is that parallel query may lead to “unexpected” dynamic sampling, which can make a huge difference to the choice of execution plans, but that’s another matter.)

The initial cardinality error in the plan came from the following predicate on a “Date dimension” table:


      AR_DATE.CALENDAR_DT   = AR_DATE.MONTH_END_DT 
AND   AR_DATE.CALENDAR_DT  >= ADD_MONTHS(TRUNC(SYSDATE,'fmyy'), -36) 
AND   AR_DATE.MONTH_END_DT >= ADD_MONTHS(TRUNC(SYSDATE,'fmyy'), -36)

In the parallel plan the estimated number of rows on a full tablescan of the table was 742, while on the serial plan the same tablescan produced a cardinality of 1. You will appreciate that having an estimate of 1 (or less) that is nearly three orders of magnitude wrong is likely to lead to a very bad execution plan.

My first thought when I saw this was (based on a purely intuitive interpretation): “there’s one day every month that’s the last day of the month and we’re looking at roughly that last 36 months so we might predict a cardinality of about 36”. That’s still a long way off the 742 estimate and 1,044 actual for the parallel query, but it’s a warning flag that the serial estimate is probably an important error – it’s also an example of the very simple “sanity checking” mental exercises that can accompany almost any execution plan analysis.

My second thought (which happened to be wrong, and would only have been right some time well before version 10.2.0.5) was that the optimizer would treat the add_months() expressions as unknown values and assign a selectivity of 5% to each of the predicates, reducing the combined selectivity to 1/400th of the selectivity it gave to the first predicate. In fact the optimizer evaluates the expressions and would have used the normal (required range / total range) calculation for those two predicates.

It’s the first predicate that I want to examine, though – how does the optimizer calculate a selectivity for it ? Here’s some code to generate sample data to use for testing.


rem
rem     Script:         month_end.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2016
rem

create table t1
nologging
as
select
        rownum                                                   id,
        to_date('01-Jan-2010','dd-mon-yyyy') + rownum - 1       calendar_date,
        add_months(
                trunc(to_date('01-Jan-2010','dd-mon-yyyy') + rownum - 1 ,'MM' ),
                1
        ) - 1                                                   month_end_date
from
        dual
connect by
        level <= trunc(sysdate) - to_date('01-jan_2010','dd-mon-yyyy') + 1
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

This clunky bit of code gives me consecutive dates from 1st Jan 2010 up to “today” with the month_end_date column holding the month end date corresponding to the row’s calendar_date. So now we can check what the optimizer makes of the predciate calendar_date = month_end_date:


set autotrace on explain

select count(*) from t1 where calendar_date = month_end_date;

  COUNT(*)
----------
        79

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    16 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    16 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     1 |    16 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("CALENDAR_DATE"="MONTH_END_DATE")

Looking at operation 2 we can see that, in effect, the optimizer has considered two independent predicates “calendar_date = {unknown}” and “month_end_date = {unknown}” and taken the lower of the two selectivities – which means the cardinality estimate is 1 because the calendar_date column is unique across this table.

There are various ways to work around problems like this. One of the simplest would be to tell Oracle to sample this table with the (table-level) hint /*+ dynamic_sampling(t1 1) */; in fact, since this predicate is effectively treated as an examination of two predicates the (cursor-level) hint /*+ dynamic_sampling(4) */ would also cause sampling to take place – note that level 4 or higher is required to trigger sampling for “multiple” predicates on a single table. As a general guideline we always try to minimise the risk of side effects so if this problem were embedded in a much larger query I would prefer the table-level hint over the cursor-level hint.

There are other options, though, that would allow you to bypass sampling – provided you can modify the SQL. The script I used to create this table also included the following statement:


alter table t1 add (
        date_offset1 generated always as (calendar_date - month_end_date) virtual,
        date_flag generated always as (case when calendar_date - month_end_date = 0 then 'Y' end) virtual
);

In 12c I would declare these virtual columns to be invisible to avoid problems with any SQL that didn’t use explicit column lists. For demonstration purposes I’ve set up two options – I can find the rows I want with one of two obvious predicates:

    date_offset1 = 0
    date_flag = 'Y'

In fact there’s a third predicate I could use that doesn’t need to know about the virtual columns:

    calendar_date - month_end_date = 0

Unfortunately I can’t arbitrarily swap the order of the two dates in the last predicate, and the optimizer won’t spot that it is also equivalent to “calendar_date = month_end_date”. Here are a few execution plans – for which the only significant bit is the cardinality estimate of the full tablescans:


select count(*) from t1 where date_flag = 'Y';

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     2 |     4  (25)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     2 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |    79 |   158 |     4  (25)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DATE_FLAG"='Y')



select count(*) from t1 where date_offset1 = 0;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |    78 |   312 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DATE_OFFSET1"=0)



select count(*) from t1 where calendar_date - month_end_date = 0;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |    78 |   312 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."DATE_OFFSET1"=0)

It’s interesting to note that the optimizer has transformed the last predicate into the equivalent virtual column expression to do the arithmetic. You might also note that the date_flag option is slightly more accurate, but that’s because it’s based on an expression which is null for the rows we don’t want while the date_offset1 column has a value for every row and a little bit of arithmetical rounding comes into play. You might also note that there’s a small cost difference – which I’d ascribe to the CPU cost that the optimizer has added for the CASE expression being applied on top of the simple date arithmetic.

Of course, whatever else you might play around with when working around a cardinality problem like this, I think the strategic aim for a data warehouse system would be to get a REAL flag column on the table and populate it at data loading time if month-end dates played an important part in the proceedings – though I have to say that the virtual flag column is one I do rather fancy.

 


Basicfile LOBs

Fri, 2016-08-05 10:12

I got a call to a look at a performance problem involving LOBs a little while ago. The problem was with an overnight batch that had about 40 sessions inserting small LOBs (12KB to 22KB) concurrently, for a total of anything between 100,000 and 1,000,000 LOBs per night. You can appreciate that this would eventually become a very large LOB segment – so before the batch started all LOBs older than one month were deleted.

The LOB column had the following (camouflaged) declaration:

 LOB (little_lob) STORE AS BASICFILE (
        TABLESPACE lob_ts 
        ENABLE STORAGE IN ROW 
        RETENTION
        NOCACHE 
        LOGGING
)

The database was 11gR2, the tablespace was defined with ASSM with uniform 1MB extents and a blocksize of 8KB (so the LOBs were all 2 or 3 chunks) and the undo retention time was 900 seconds. The effect of the “enable storage in row” is that the LOBINDEX didn’t have to hold any details of current LOB chunks (for in-row, the first 12 chunks are listed in the LOB Locator in the base table).

So, examining an AWR report covering the critical interval, reviewing the captured ASH data, and checking the database, a few questions came to mind:

  • With 200 GB of current LOB data in the segment, why was the segment roughly 800GB ?
  • With no need for current LOBs to be indexed, how had the LOB Index reached 500,000 blocks in size ?
  • There had been 500,000 inserts that night – so why had Oracle done 6 Billion (cached) buffer gets on the (nocache) LOB segment ?
  • Given that the LOB Segment had not changed size during the night, why had there been millions of HW enqueue wait on the inserts ?

Knowing the stuff that I did know about basicfile LOBs it seemed likely that the most significant problem was that the segment hadn’t been created with multiple freepools which, according to the very sparse (and not entirely self-consistent) documentation, exist to allow improved concurrency. So I thought I’d search the Internet for any useful information about freepools, how they worked, what impact they might have on this problem, why their absence might produce the symptoms I’d seen, and what the best course of action would be to address the problem.

Of course the “correct” solution according to MoS would be to convert from basicfile to securefile – with a strange insistence on using online redefinition, but no explanation of why a simple CTAS or alter table move is undesirable or dangerous. Unfortunately there are a couple of notes on MoS describing performance issues with “high” levels of concurrent inserts that need to be addressed by setting hidden parameters so I’m not (yet) keen on rebuilding 700GB of a production system to produce a change that might still not work quickly enough; especially since I couldn’t find anything on MoS that could quantify the time needed to do the conversion.

To my surprise I couldn’t find a single useful piece of information about the problem. The only articles I could find seemed to be little bits of cut-n-paste from the Oracle manual pages about using multiple freepools, and the best of those actually demonstrated rebuilding or changing the freepools settings on a LOB of a few megabytes. The most significant MoS note did say that the process “could be slow” and would lock the table. But surely someone, somewhere, must have tried it on a big system and had some idea of “how slow”.

In the end I had to start building some simple models and doing a few experiments to find out what happens and where the time goes and what causes the strange results and – most importantly – how freepools might help. Fortunately, following a call to the Oak Table for any ideas or links to useful documents, Kurt van Meerbeeck directed me to the original Oracle patents which were enormously helpful in showing why freepools could help and why, in the wrong circumstances, you could still end up with a (slightly smaller) disaster on your hands.

 

To be continued …

 

Until I find time to write up the mechanisms I’ve discovered I’m happen to respond to any quick questions about this problem and I’m more than happy to hear about experiences others have had either rebuilding freepools, changing the number of freepools, or migrating from basicfile to securefile LOBs.

 


Adaptive mayhem

Tue, 2016-08-02 10:29

So you run a query and it gives you a plan with a note that says “This is an adaptive plan”.

So you run it again and the plan changes,  with a note that says “Statistics feedback used for this statement”

So you pause to think for a bit, then run the query again and the plan changes, with a note that says “One SQL Directive used, dynamic statistics used”. (You waited too long and the internal re-optimization hints got flushed down into an SQL directive.)

So you decide to think about it the following morning when you’re feeling bright and fresh, and when you run it you get another plan because overnight the automatic stats job gathered stats on the critical table and created a column group that was indicated by the (now defunct) directive.

Happy optimisation!

 


Lost Concatenation

Tue, 2016-07-26 04:46

This note models one feature of a problem that came up at a client site recently from a system running 12.1.0.2 – a possible bug in the way the optimizer handles a multi-column in-list that can lead to extremely bad cardinality estimates.

The original query was a simple three table join which produced a bad plan with extremely bad cardinality estimates; there was, however, a type-mismatch in one of the predicates (of the form “varchar_col = numeric”), and when this design flaw was addressed the plan changed dramatically and produced good cardinality estimates. The analysis of the plan, 10053 trace, and 10046 trace files done in-house suggested that the problem might relate in some way to an error in the handling of SQL Plan Directives to estimate cardinalities.

This was one of my “solve it in a couple of hours over the internet” assignments and I’d been sent a sample of the original query with the 10046 and 10053 trace files, and a modified version of the query that bypassed the problem, again including the 10046 and 10053 trace files, with a request to explain the problem and produce a simple test case to pass to Oracle support.

The first thing I noticed was that there was something very strange about the execution plan. Here’s the query and plan in from my simplified model, showing the same anomaly:


select  /*+ no_expand */
        count(*)
from    t1, t2
where
        t2.shipment_order_id = t1.order_id
and     (t1.id, t2.v1) in ( (5000, 98), (5000, 99))
;

-------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |       |       |       |   331 (100)|          |
|   1 |  SORT AGGREGATE                       |       |     1 |    19 |       |            |          |
|*  2 |   HASH JOIN                           |       |     1 |    19 |  2056K|   331   (5)| 00:00:01 |
|   3 |    TABLE ACCESS FULL                  | T2    |   100K|   878K|       |   219   (3)| 00:00:01 |
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   100K|   976K|       |     2   (0)| 00:00:01 |
|   5 |     BITMAP CONVERSION TO ROWIDS       |       |       |       |       |            |          |
|   6 |      BITMAP OR                        |       |       |       |       |            |          |
|   7 |       BITMAP CONVERSION FROM ROWIDS   |       |       |       |       |            |          |
|*  8 |        INDEX RANGE SCAN               | T1_PK |       |       |       |     1   (0)| 00:00:01 |
|   9 |       BITMAP CONVERSION FROM ROWIDS   |       |       |       |       |            |          |
|* 10 |        INDEX RANGE SCAN               | T1_PK |       |       |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T2"."SHIPMENT_ORDER_ID"="T1"."ORDER_ID")
       filter((
                  (TO_NUMBER("T2"."V1")=98 AND "T1"."ID"=5000) 
               OR (TO_NUMBER("T2"."V1")=99 AND "T1"."ID"=5000)
       ))
   8 - access("T1"."ID"=5000)
  10 - access("T1"."ID"=5000)

Before going on I meed to remind you that this is modelling a production problem. I had to use a hint to block a transformation that the optimizer wanted to do with my data set and statistics, I’ve got a deliberate type-mismatch in the data definitions, and there’s a simple rewrite of the SQL that would ensure that Oracle does something completely different).

The thing that caught my eye was the use of the bitmap transformation (operations 5,7,9) using exactly the same index range scan twice (operations 8,10). Furthermore, though not visible in the plan, the index in question was (as the name suggests) the primary key index on the table and it was a single column index – and “primary key = constant” should produce an “index unique scan” not a range scan.

Once you’ve added in the fact that operations 8 and 10 are the same “primary key = constant” predicates, you can also pick up on the fact that the cardinality calculation for the table access to table t1 can’t possibly produce more than one row – but it’s reporting a cardinality estimate of 100K rows (which happens to be the number of rows in the table.)

As a final point, you can see that there are no “Notes” about Dynamic Statistics or SQL Directives – this particular issue is not caused by anything to do with 12c sampling. In fact, having created the model, I ran it on 11.2.0.4 and got the same strange bitmap conversion and cardinality estimate. In the case of the client, the first pass the optimizer took went through exactly the same sort of process and produced a plan which was (probably) appropriate for a query where the driving table was going to produce (in their case) an estimated 4 million rows – but not appropriate for the actual 1 row that should have been identified.

In my example, if I allowed concatenation (i.e. removed the no_expand hint) I got the following plan:


------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |       |       |       |     8 (100)|          |
|   1 |  SORT AGGREGATE                        |       |     1 |    19 |            |          |
|   2 |   CONCATENATION                        |       |       |       |            |          |
|   3 |    NESTED LOOPS                        |       |     1 |    19 |     4   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID        | T1    |     1 |    10 |     2   (0)| 00:00:01 |
|*  5 |      INDEX UNIQUE SCAN                 | T1_PK |     1 |       |     1   (0)| 00:00:01 |
|*  6 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2    |     1 |     9 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN                  | T2_I1 |     1 |       |     1   (0)| 00:00:01 |
|   8 |    NESTED LOOPS                        |       |     1 |    19 |     4   (0)| 00:00:01 |
|   9 |     TABLE ACCESS BY INDEX ROWID        | T1    |     1 |    10 |     2   (0)| 00:00:01 |
|* 10 |      INDEX UNIQUE SCAN                 | T1_PK |     1 |       |     1   (0)| 00:00:01 |
|* 11 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2    |     1 |     9 |     2   (0)| 00:00:01 |
|* 12 |      INDEX RANGE SCAN                  | T2_I1 |     1 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1"."ID"=5000)
   6 - filter(TO_NUMBER("T2"."V1")=99)
   7 - access("T2"."SHIPMENT_ORDER_ID"="T1"."ORDER_ID")
  10 - access("T1"."ID"=5000)
  11 - filter((TO_NUMBER("T2"."V1")=98 AND (LNNVL(TO_NUMBER("T2"."V1")=99) OR
              LNNVL("T1"."ID"=5000))))
  12 - access("T2"."SHIPMENT_ORDER_ID"="T1"."ORDER_ID")

This is a much more appropriate plan – and similar to the type of plan the client saw when they eliminated the type-mismatch problem (I got a completely different plan when I used character values ’98’ and ’99’ in the in-list or when I used a numeric column with numeric literals).

Examining my 10053 trace file I found the following:

  • In the BASE STATISTICAL INFORMATION, the optimizer had picked up column statistics about the order_id column, but not about the id column in the in-list – this explained why the cardinality estimate was 100K, Oracle had “lost” the predicate.
  • In the “SINGLE TABLE ACCESS PATH”, the optimizer had acquired the statistics about the id column and calculated the cost of using the t1_pk index to access the table for a single key (AllEqUnique), then calculated the cost of doing a bitmap conversion twice (remember we have two entries in the in-list – it looks like the optimizer has “rediscovered” the predicate). But it had still kept the table cardinality of 4M.

After coming up with a bad plan thanks to this basic cardinality error, the 10053 trace file for the client’s query then went on to consider or-expansion (concatenation). Looking at this part of their trace file I could see that the BASE STATISTICAL INFORMATION now included the columns relevant to the in-list and the SINGLE TABLE ACCESS PATH cardinalities were suitable. Moreover when we got to the GENERAL PLANS the join to the second table in the join order showed a very sensible cost and cardinality – unfortunately, having been sensible up to that point, the optimizer then decided that an SQL Plan Directive should be used to generate a dynamic sampling query to check the join cardinality and the generated query again “lost” the in-list predicate, resulting in a “corrected” cardinality estimate of 6M instead of a correct cardinality estimate of 1. As usual, this massive over-estimate resulted in Oracle picking the wrong join method with a huge cost for the final join in the client’s query – so the optimizer discarded the or-expansion transformation and ran with the bad bitmap/hash join plan.

Bottom line for the client – we may have seen the same “lose the predicate” bug appearing in two different ways, or we may have seen two different “lose the predicate” bugs – either way a massive over-estimate due to “lost” predicates during cardinality calculations resulted in Oracle picking a very bad plan.

Footnote:

If you want to do further testing on the model, here’s the code to generate the data:


create table t1
nologging
as
with generator as (
        select  rownum id
        from    dual
        connect by
                level <= 1e4
)
select
        rownum                                  id,
        rownum                                  order_id,
        rpad('x',100)                           padding
from
        generator, generator
where
        rownum <= 1e5
;

execute dbms_stats.gather_table_stats(user,'t1')

alter table t1 modify order_id not null;
alter table t1 add constraint t1_pk primary key(id);


create table t2
nologging
as
with generator as (
        select  rownum id
        from    dual
        connect by
                level <= 1e4
)
select
        rownum                                  shipment_order_id,
        mod(rownum-1,1000)                      n1,
        cast(mod(rownum-1,1000) as varchar2(6)) v1,
        rpad('x',100)                           padding
from
        generator, generator
where
        rownum <= 1e5
;

execute dbms_stats.gather_table_stats(user,'t2')

alter table t2 modify shipment_order_id not null;
create index t2_i1 on t2(shipment_order_id);

The interesting question now is WHY does Oracle lose the predicate – unfortunately my model may be too simplistic to allow us to work that out, but it might be sufficient to make it easy for an Oracle developer to see what’s going on and how best to address it. There is one bug on MoS (23343961) that might be related in some way, but I wasn’t convinced that the description was really close enough.

Update

This issue is now recorded on MoS as: Bug 24350407 : WRONG CARDINALITY ESTIMATION IN PRESENCE OF BITMAP OR

 


Union All MV

Tue, 2016-07-12 04:10

In an article I wrote last week about Bloom filters disappearing as you changed a SELECT to a (conventional) INSERT/SELECT I suggested using the subquery_pruning() hint to make the optimizer fall back to an older strategy of partition pruning. My example showed this working with a range partitioned table but one of the readers reported a problem when trying to apply the strategy to a composite range/hash partitioned table and followed this up with an execution plan of a select statement with a Bloom filter where the subquery_pruning() hint didn’t introduced subquery pruning when the select was used for an insert.

A couple of standard ways to work around this probelm are to embed the select statement in a pipeline function so that we can “insert into table select from table(pipeline_function)”, or to write a pl/sql block that opens a cursor to do a select with bulk collect and loops through an array insert. The overhead in both cases is likely to be relatively small (especially when compared with the overhead of failing to filter). In this case, however, the reader suggested that maybe the problem appeared because the driving table (i.e. the one that would have been query to derive the pruning values) was actually an inline view with a union all.

After modifying my working model to try a couple of different tests I was inclined to agree. Since the two tables in the view looked as if they were likely to be relatively tiny and static I suggested that it would be safe to create a materialized view defined to “refresh on commit” and then use the materialized view explicitly in the query. This, finally, brings me to the point of today’s article – how do you create such a materialized view ?

I’m going to start by creating a couple of small base tables from a familiar object:


create table tt as select * from all_objects where object_type = 'TABLE';
create table tv as select * from all_objects where object_type = 'VIEW';

alter table tt add constraint tt_pk primary key (object_id);
alter table tv add constraint tv_pk primary key (object_id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'TT',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'TV',
                method_opt       => 'for all columns size 1'
        );
end;
/

Assume, now, that I need an inline view that is interested in the things you will recognise from the above as the tables owned by OUTLN (which will apper in tt) and the views owned by SYSTEM (which will appear in tv) – in the 11.2.0.4 system I’m playing on at the moment that’s three rows from each of the two tables). Here’s the SQL I’d put into the inline view:


select
        object_id, object_type, object_name
from    tt
where   owner = 'OUTLN'
union all
select
        object_id, object_type, object_name
from    tv
where   owner = 'SYSTEM'
;

Since this view won’t give me partition pruning I have to replace it with a table and because I want to ensure that the table is always up to date I have to generate it as the container for a materialized view with refresh on commit. First I need some materialized view logs so that I can do a fast refresh:


create materialized view log on tt
with
        rowid, primary key
        (object_type, object_name, owner)
including new values
;

create materialized view log on tv
with
        rowid, primary key
        (object_type, object_name, owner)
including new values
;

I’ve included the primary key in the definition because I happen to want the object_id column in the log – but I could just have included it as a column in the filter list. I’ve included the rowid in the definition because Oracle needs the rowid if it’s going to be able to do a fast refresh. I can now create a materialized view:


create materialized view mv_t
        build immediate
        refresh fast on commit
as
select
        'T' mv_marker,
        rowid rid,
        object_id, object_type, object_name
from    tt
where   owner = 'OUTLN'
union all
select
        'V' mv_marker,
        rowid rid,
        object_id, object_type, object_name
from    tv
where   owner = 'SYSTEM'
;

I’ve taken the option to “build immediate” and specified – most importantly for my needs – “refresh on commit”. You’ll notice I haven’t chosen to “enable query rewrite”; for the purposes of this demo I don’t need that particular feature.

There are two key features to the materialized view that are a little special – first I’ve included the rowid of each source table as a named column in the materialized view; as I mentioned above Oracle will not allow the view to be fast refreshable without the rowid. The second feature is that I’ve introduced a literal value into the view which I’ve named mv_marker; this makes it easy to see which table a row comes from when you query the materialized view … and Oracle needs to see this.

That’s the job done. Just to demonstrate that my materialized view is working as required here’s a little more SQL (following by the output):


select * from mv_t;

delete from tt where object_name = 'OL$';
update tv set object_name = 'PRODUCT_PRIVILEGES' where object_name = 'PRODUCT_PRIVS';

commit;

select * from mv_t;

=======================================

M RID                 OBJECT_ID OBJECT_TYPE         OBJECT_NAME
- ------------------ ---------- ------------------- --------------------------------
T AAA6tXAAFAAAAEBAAI        471 TABLE               OL$
T AAA6tXAAFAAAAEBAAJ        474 TABLE               OL$HINTS
T AAA6tXAAFAAAAEBAAK        478 TABLE               OL$NODES
V AAA6tWAAFAAAACgABI       8260 VIEW                SCHEDULER_PROGRAM_ARGS
V AAA6tWAAFAAAACgABJ       8261 VIEW                SCHEDULER_JOB_ARGS
V AAA6tWAAFAAAACuAA7      14233 VIEW                PRODUCT_PRIVS

6 rows selected.

2 rows deleted.


1 row updated.


Commit complete.


M RID                 OBJECT_ID OBJECT_TYPE         OBJECT_NAME
- ------------------ ---------- ------------------- --------------------------------
T AAA6tXAAFAAAAEBAAJ        474 TABLE               OL$HINTS
T AAA6tXAAFAAAAEBAAK        478 TABLE               OL$NODES
V AAA6tWAAFAAAACgABI       8260 VIEW                SCHEDULER_PROGRAM_ARGS
V AAA6tWAAFAAAACgABJ       8261 VIEW                SCHEDULER_JOB_ARGS
V AAA6tWAAFAAAACuAA7      14233 VIEW                PRODUCT_PRIVILEGES

5 rows selected.

If you’re wondering why you see “2 rows deleted” but a reduction by just one row in the final output, remember that we’re deleting from table tt but the materialized view holds information about just the subset of tables owned by OUTLN – I happen to have a row in tt that says SYSTEM also owns a table called OL$.

Assistance

If you have trouble working out why your attempts to create a particular materialized view aren’t working the dbms_mview package has a procedure called explain_mview that may give you enough ideas to work out what you’re doing wrong. For example, here’s how I could find out that I needed a literal column to tag the two parts of my union all view:


@$ORACLE_HOME/rdbms/admin/utlxmv.sql

begin
        dbms_mview.explain_mview (
                q'{
                create materialized view mv_t
                        build immediate
                        refresh fast
                        enable query rewrite
                as
                select  -- 'T' mv_marker,
                        rowid rid,
                        object_id, object_type, object_name from tt
                union all
                select  -- 'V' mv_marker,
                        rowid rid,
                        object_id, object_type, object_name from tv
                }'
        );
end;
/

column cap_class noprint
column related_text format a7
column short_msg format a72
break on cap_class skip 1

select
        substr(capability_name,1,3) cap_class,
        capability_name, possible, related_text, substr(msgtxt,1,70) short_msg
from
        mv_capabilities_table
where
        mvname = 'MV_T'
order by
        substr(capability_name,1,3), related_num, seq
;

The first line calls a supplied script to create a table called mv_capabilities_table in the current schema. The call to dbms_mview.explain_mview passes the text of a “create materialized view” statement to the procedure (there are a couple of variations possible) then, after a couple of SQL*Plus formatting commands I’ve queried the table to see Oracle’s analysis for the statement. (You can tag each call to this procedure using a second parameter that I haven’t bothered to use.)

Here’s the output for the failed attempt above, which has commented out the literals that tag the two parts of the UNION ALL:

CAPABILITY_NAME                POS RELATED SHORT_MSG
------------------------------ --- ------- ------------------------------------------------------------------------
PCT_TABLE                      N   TT      relation is not a partitioned table
PCT_TABLE_REWRITE              N   TT      relation is not a partitioned table
PCT_TABLE                      N   TV      relation is not a partitioned table
PCT_TABLE_REWRITE              N   TV      relation is not a partitioned table
PCT                            N

REFRESH_COMPLETE               Y
REFRESH_FAST                   N
REFRESH_FAST_AFTER_INSERT      N           the materialized view does not have a UNION ALL marker column
REFRESH_FAST_AFTER_INSERT      N           set operator in a context not supported for fast refresh
REFRESH_FAST_AFTER_ONETAB_DML  N           see the reason why REFRESH_FAST_AFTER_INSERT is disabled
REFRESH_FAST_AFTER_ANY_DML     N           see the reason why REFRESH_FAST_AFTER_ONETAB_DML is disabled
REFRESH_FAST_PCT               N           PCT FAST REFRESH is not possible if query has set operand query blocks

REWRITE                        Y
REWRITE_FULL_TEXT_MATCH        Y
REWRITE_PARTIAL_TEXT_MATCH     N           set operator encountered in mv
REWRITE_GENERAL                N           set operator encountered in mv
REWRITE_PCT                    N           general rewrite is not possible or PCT is not possible on any of the d


17 rows selected.

The query manages to split the output into three sections (but that depends on a side-effect in a way that I would normally call bad design): elements relating to “Partition Change Tracking”, elements relating to “Materialized View Refresh” and elements relating to “Query Rewrite”. You’ll notice that the rewrite section tells me that (even though I haven’t chosen to enable it) my view could be enabled to do query rewrite.

Critically, though, this version of the materialized view can’t be fast refreshed, and we see the key reason in the first “Refresh fast after insert” line: “the materialized view does not have a UNION ALL marker column”. That’s how I know I have to include a literal column that has a different value in each of the two parts of the UNION ALL.


DML and Bloom

Fri, 2016-07-08 07:01

One of the comments on my recent posting about “Why use pl/sql bulk strategies over simple SQL” pointed out that it’s not just distributed queries that can change plans dramatically when you change from a simple select to “insert into … select …”; there’s a similar problem with queries that use Bloom filters – the filter disappears when you change from the query to the DML.

This seemed a little bizarre, so I did a quick search on MoS (using the terms “insert select Bloom Filter”) to check for known bugs and then tried to run up a quick demo. Here’s a summary of the related bugs that I found through my first simple search:

Bug 20112932 : BLOOM FILTER IS NOT CHOOSED WHEN EXECUTING INSERT SELECT STATEMENT
11.2.0.4 - Nov 2014: "not a bug".
This cites three more documents:
  Bug 13801198  BLOOM PRUNING/FILTER NOT USED IN DML STATEMENTS 
    11.2.0.3 - March 2012 - Duplicate, base bug: 18949550
    18949550: MERGE QUERY NOT USING BLOOM FILTER
    11.2.0.3 - June 2014: fixed in 12.2 (no patches yet)

  Bug 14325392  BLOOM FILTER NOT SUPPORTED IN PARTITION WISE JOIN 
    11.2.0.3 - July 2012 - fixed in 12.1 
    (but I think this is a PX filter, not a partition filter)

  Bug 17716301  BLOOM FILTER NOT USED FOR OUTER JOIN</pre>

    Aug 2014 - hidden bug note. Patches available for 11.2.0.3
    Note 1919508.1 describes the problem.

And here’s some code to demonstrate the problem – running 11.2.0.4 (with similar results on 12.1.0.2).


rem
rem     Script:         dml_bloom_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2016
rem

create table pt_hash (
        id,
        grp,
        small_vc,
        padding
)
nologging
pctfree 90 pctused 10
partition by hash(grp)
(
        partition p1,
        partition p2,
        partition p3,
        partition p4
)
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                          id,
        trunc(rownum/50)                grp,
        to_char(trunc(rownum/20))       small_vc,
        rpad('x',100)                   padding
from
        generator, generator
where
        rownum <= 1e5
;

create table t1
as
select
        rownum  id,
        rownum  n1,
        lpad(rownum,10) v1,
        rpad('x',100) padding
from
        dual
connect by
        rownum <= 4
;

create table target (
        n1      number,
        id      number
)
;

-- gather stats on t1 and pt_hash (for all columns size 1, otherwise default)

There’s nothing fancy in this code, just a hash partitioned table with 100,000 rows (and a lot of wasted space), and a “driving” table with 4 rows that I can use in a very simple join to the partitioned table. I’ve also got an empty table ready for an insert statement.

So here’s a query with its execution plan (with rowsource execution statistics) that (based on the content of the t1 table) should select from at most two of the four partitions. I’ve hinted a hash join which is where Bloom filtering can be very effective with partitioned table joins:


set serveroutput off
alter session set statistics_level = all;

select
        /*+
                leading(t1 h1)
                use_hash(h1)    no_swap_join_inputs(h1)
        */
        t1.n1,
        h1.id
from
        t1,
        pt_hash h1
where
        t1.id between 2 and 3
and     h1.grp = t1.n1
and     h1.id <= 100
;

select * from table(dbms_xplan.display_cursor(null,null,'outline allstats last cost partition'));

------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |  2115 (100)|       |       |      1 |00:00:00.16 |    8374 |   8367 |       |       |          |
|*  1 |  HASH JOIN                  |         |      1 |      3 |  2115   (2)|       |       |      1 |00:00:00.16 |    8374 |   8367 |  2440K|  2440K|  777K (0)|
|   2 |   PART JOIN FILTER CREATE   | :BF0000 |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |      0 |       |       |          |
|*  3 |    TABLE ACCESS FULL        | T1      |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |      0 |       |       |          |
|   4 |   PARTITION HASH JOIN-FILTER|         |      1 |    100 |  2113   (2)|:BF0000|:BF0000|     51 |00:00:00.16 |    8372 |   8367 |       |       |          |
|*  5 |    TABLE ACCESS FULL        | PT_HASH |      2 |    100 |  2113   (2)|:BF0000|:BF0000|     51 |00:00:00.16 |    8372 |   8367 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

-- Predicates and Outline redacted

The most significant points to notice are:

  • Operation 2 – Part Join Filter Created: we’ve created a Bloom filter
  • Operation 4 – Partition Hash Join – Filter: we’ve used the Bloom filter, Pstart/Pstop = :BF0000
  • Operations 5 – Table Access Full: starts twice, the two partitions we need

And now we use the same SQL statement but insert the result set into the holding table:

insert into target(n1, id)
select
        /*+
                ordered
                use_hash(h1)    no_swap_join_inputs(h1)
        */
        t1.id,
        h1.id
from
        t1,
        pt_hash h1
where
        t1.id between 2 and 3
and     h1.grp = t1.n1
and     h1.id <= 100
;

select * from table(dbms_xplan.display_cursor(null,null,'outline allstats last cost partition'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name    | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |         |      1 |        |  2115 (100)|       |       |      0 |00:00:00.02 |   16698 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL |         |      1 |        |            |       |       |      0 |00:00:00.02 |   16698 |       |       |          |
|*  2 |   HASH JOIN              |         |      1 |      3 |  2115   (2)|       |       |      1 |00:00:00.02 |   16689 |  1969K|  1969K|  723K (0)|
|*  3 |    TABLE ACCESS FULL     | T1      |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |       |       |          |
|   4 |    PARTITION HASH ALL    |         |      1 |    100 |  2113   (2)|     1 |     4 |    100 |00:00:00.02 |   16687 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | PT_HASH |      4 |    100 |  2113   (2)|     1 |     4 |    100 |00:00:00.02 |   16687 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------

-- Predicate and Outline sections redacted

In this case we don’t see any operation to create a Bloom filter, and operation 5 reports 4 starts, one for each partition, with the Pstart/Pstop showing 1 – 4. As a cross-check you can also see that the buffers visited has doubled and the time (though don’t trust it too much on such a small test) has increased.

To my surprise the outline information for the first plan didn’t report a Bloom filter hint (px_join_filter), but I tried adding one to the insert statement anyway – and nothing changed; however there is a “subquery pruning” mechanism that’s been available to the optimizer for a very long time now so I decided to try hinting that for both queries /*+ subquery_pruning(@sel$1 h1@sel$1 partition) */. This had no effect on the select statement, but produced the following plan for the insert:


-------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name    | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT          |         |      1 |        |  2115 (100)|       |       |      0 |00:00:00.01 |    8390 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL  |         |      1 |        |            |       |       |      0 |00:00:00.01 |    8390 |       |       |          |
|*  2 |   HASH JOIN               |         |      1 |      3 |  2115   (2)|       |       |      1 |00:00:00.01 |    8381 |  1969K|  1969K|  829K (0)|
|*  3 |    TABLE ACCESS FULL      | T1      |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |       |       |          |
|   4 |    PARTITION HASH SUBQUERY|         |      1 |    100 |  2113   (2)|KEY(SQ)|KEY(SQ)|     51 |00:00:00.01 |    8379 |       |       |          |
|*  5 |     TABLE ACCESS FULL     | PT_HASH |      2 |    100 |  2113   (2)|KEY(SQ)|KEY(SQ)|     51 |00:00:00.01 |    8377 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see operation 4 now reports Partition Hash Subquery – Oracle has “pre-queried” the driving table (you can see the statement in the 10046 trace file if you enable SQL trace) to discover which partitions it will have to visit – and operation 5 starts twice with a Pstart/Pstop of KEY(SQ).

Interestingly the hint didn’t end up in the Outline section of the execution plan, so this is an example of a query which you can’t fix by capturing the SQL Plan Baseline for one statement against the text for another – though you can hack the hint into an SQL Profile or (my preference in this case, I think) an SQL Patch.

Footnote:

After publishing this note I started to browse through the comments on the article I’d linked to reference subquery pruning, and discovered that an example of this insert/select issue appeared there with a surprising workaround, which was to use the “multi-table insert” (with just a single table). At the moment, though I haven’t managed to get this working with this example.

Footnote 2:

I’ve just been reminded by email of another detail mentioned in the comments of the earlier post – this problem isn’t present for direct path (i.e. /*+ append */) inserts – and I have run a quick test on 12.1.0.2 to check that this is indeed the case (though the fix isn’t present in 11.2.0.4).  Don’t forget to check the listed Bug notes on MoS, in case there’s a back-port available for your version of Oracle.

 


Invisible Bug

Thu, 2016-07-07 11:27

At this Wednesday’s Oracle Midlands event someone asked me if Oracle would use the statistics on invisible indexes for the index sanity check. I answered that there had been a bug in the very early days of invisible indexes when the distinct_key statistic on the index could be used even though the index itself would not be considered as a candidate in the plan (and the invisible index is still used to avoid foreign key locking – even in 12c – it’s only supposed to be invisible to the optimizer).

The bug was fixed quite a long time ago – but a comment on the “Index Sanity” article has introduced me to a related bug that is still present in 11.2.0.4 where the presence of an invisible index can affect an execution plan. Here’s a little model (run under 11.2.0.4) to demonstrate:

rem
rem     Script:         invisible_index_bug.sql
rem     Author:         Jonathan Lewis
rem

execute dbms_random.seed(0)

drop table t2;
drop table t1;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        cast(mod(rownum,1000) as number(8,0))           n1,
        cast(lpad(rownum,10,'0') as varchar2(10))       v1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table t2
as
select
        rownum id,
        trunc(dbms_random.value(0,10000)) n1
from
        dual
connect by
        level <= 100
;
begin 
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

column n1 new_value m_n1
select n1 from t2 where id = 50;
clear columns

set autotrace traceonly explain

select
        t1.*
from
        t1, t2
where
        t2.n1 = &m_n1
;

create unique index t2_i1 on t2(n1)
-- invisible
;

select
        t1.*
from
        t1, t2
where
        t2.n1 = &m_n1
;

set autotrace off

All I’ve done is create a couple of tables then do a join that we might expect to see executed as a cartesian merge join; at one point I was going to make the data more complicated and include a join condition, but decided to keep things small and simple so it’s a silly example but it is sufficient to make the point. The funny little bit about selecting an n1 value from t2 was also in anticipation of a more complex example but it does, at least, ensure I query for a value that is in range.

Here are the two execution plans from 11.2.0.4 – the key feature is that the plan changes after the invisible index is created:


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  1000K|   119M|  2263   (3)| 00:00:12 |
|   1 |  MERGE JOIN CARTESIAN|      |  1000K|   119M|  2263   (3)| 00:00:12 |
|*  2 |   TABLE ACCESS FULL  | T2   |     1 |     4 |     2   (0)| 00:00:01 |
|   3 |   BUFFER SORT        |      |  1000K|   115M|  2261   (3)| 00:00:12 |
|   4 |    TABLE ACCESS FULL | T1   |  1000K|   115M|  2261   (3)| 00:00:12 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T2"."N1"=5308)


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  1000K|   119M|  2263   (3)| 00:00:12 |
|   1 |  NESTED LOOPS      |      |  1000K|   119M|  2263   (3)| 00:00:12 |
|*  2 |   TABLE ACCESS FULL| T2   |     1 |     4 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1   |  1000K|   115M|  2261   (3)| 00:00:12 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T2"."N1"=5308)

Notice how the plan without the invisible index does a “sort” (actually a “buffer sort” so it’s just copying the data into local memory), while the plan with the not quite invisible enough index in place gets away with just a full tablescan. This is bug 16564891, associated with base bug 16544878.

The bug notes say “fixed in 12.2”, but in Oracle 12.1.0.2 the first plan appears in both cases, and we have to make the index visible to get the second plan. (Take note of the need for the “negative” test to prove the point; the fact that the same plan appears for both cases doesn’t, by itself, prove that the bug was fixed, we have to show that the plan would have changed if the bug had still been present).

I believe the problem isn’t the problem of Oracle using the statistics when it shouldn’t; the change appears because in 11g Oracle incorrectly allows itself to see the uniqueness of the index and infer that table t2 is a “single row” table. In 12c the optimizer calculates that there will probably be only one row but that doesn’t stop it choosing the merge join cartesian as the “insurance bet” against having to do more than one tablescan of the t1 table. We can see this difference in the 10053 trace files, the 11g file has an entry for the “Single Table Access Path” for t2 that reads:

1-ROW TABLES:  T2[T2]#0

If you read the bug note for bug 16564891 you’ll see that it has a more realistic example of the problem – and it may give you some idea of where you might run into the bug. In general I don’t think many people are likely to come across the problem since it revolves around uniqueness, which is rather an important property, and there can’t be many occasions when someone decides to add (or test dropping) a unique index. Given that the example in the bug looks like “add a unique index to a dimension table that’s joining to a fact table” that may be a good pointer to where you’re most likely to run into the problem — when you’re trying to enforce data correctness in a data warehouse.

 


Centenary

Wed, 2016-07-06 11:02

I rarely blog about anything non-technical but after the events last Friday (1st July) I wanted to say something about the pride that I shared with several hundred parents around the country as they saw the effect their offspring created through a living memorial of the terrible waste of life that happened  a hundred years ago on 1st July 1916 when some 70,000 soldiers (a very large fraction of them British) were killed or injured on the first day of the battle of the Somme.

While a memorial service was being held at Thiepval – a monument to 72,000 British (Empire) soldiers who died in the battle of the Somme but have no known grave – 1,500 “ghosts of the Somme” were silently wending their way in small groups through the streets, shopping centres, and train stations of cities across the UK, pausing to rest from time to time and occasionally bursting into the song of the trenches: “We’re here because we’re here”.

Each “ghost” represented a specific solder killed on the first day of the Somme and if you approached one of them to ask what was going on their only response was to look you in the eye and hand you a card stating their name, rank, regiment and, where known, the age at which they died.

Although many of the posts and tweets about the event mention the collaboration and assistance of various theatre groups around the country almost all of the soldiers were simply people who had responded to an advertisement for Project Octagon and had spent time over the previous 5 weekends rehearsing for the event. My son Simon was one of the volunteers who was on the London beat, starting with the morning commuters at Kings Cross then trekking around London all day – in hobnailed leather boots – to end at Waterloo station for the evening commuters.

After hours of walking this was how he appeared at Waterloo at the end of the day:

IMG_0877

Like me he normally sports a beard and moustache but he’d shaved the beard off and trimmed the moustache to the style of an older era. The absent, dazed, look is in character for the part but also, I think, an indication of exhaustion, both physical and emotional. I wasn’t even sure he’d realised I was crouching six feet away to take this photo until I asked him about it the following day. When I showed the picture to my wife it brought tears to her eyes to think that 100 years ago that might have been the last sight of her son she’d see before he went off to die – it’s a sentiment that appeared more than once on Twitter through the day.

Shortly before 6:00 pm several more groups converged on Waterloo for a final tableau giving us a rendition of “We’re here because we’re here” that ended in an agonised scream:

IMG_0872

It’s a gut-wrenching thought that a group that size would have been killed roughly every 6 minutes, on average, on the first day of the Somme though, realistically, the entire 1,500 that volunteered for the day would probably have died in the first few minutes of the first wave.

Behind the Scenes

There was no announcement of this living memorial so throughout the day people were asking where the soldiers came from and who had organised the event. Finally, at 7:00 in the evening 1418-Now identified themselves as the commissioning body, with Jeremy Deller as the artist in collaboration with Rufus Norris of the National Theatre.

Like any military operation, though, between the generals at the top and the troops at the bottom there was a pyramid of personnel connecting the big picture to the final detail. Under Jeremey Deller and Rufus Norris there was a handful of key characters without whom the day would have been very different. I can’t tell you who they all were but I’m proud to say that one of them was my daughter Anna who, along with a colleague, spent a large fraction of the last 16 months in the role of “Lead Costume Supervisor ” preparing for the day. Under the pair there were several regional costume supervisors, and each costume supervisor was responsible for several dressers who would have to help the volunteers put on the unfamiliar battledress.

Despite working on the project for 16 months Anna told me very little about what was going on until the day was over, and this is a thumbnail sketch (errors and omissions are my fault) of what she’s told me so far.

Amongst other things she selected a list of names from the soldiers who had died on the first day of battle, recording their rank, regiment, battalion and, where known, age. She then had to find out exactly what kit each battalion would have been wearing on the day, allowing for some of the variation that would have appeared within each battalion and catering for the various ranks; then she had to find a supplier who could make the necessary uniforms in a range of sizes that would allow for the variation in the build of the (as yet unknown, unmeasured) volunteers.

As batches of uniforms arrived each one had to be associated with its (historic) owner and supplied with 200 cards with the owner’s details – and it was really important to ensure that the right name was attached to a uniform before the uniforms could be dispatched around the country. Ideally a uniform would arrive at a location and find a volunteer who was the right size to wear it, with the right apparent age to match the card that went with the uniform; inevitably some uniforms had to be moved around the country to match the available volunteers.

The work didn’t stop with the uniforms being in the right place at the right time, of course. There aren’t many people alive who know how to dress in a British Army uniform from 1916 – so Anna and her colleague had to create a number of videos showing the correct way to wear webbing, how to put on puttees, etc. The other problem with the uniforms was that they were brand new – so they had to be “broken down”. That’s a lot of work when you’ve got 1,500 costumes. In part this was carried out by the volunteers themselves who spent some of their rehearsal time wearing the costumes while doing energetic exercises to wear them in and get them a little grubby and sweaty; but it also meant a lot of work for the dressers who were supplied with videos showing them how to rub (the right sort of) dirt into clothes and how to rough them up and wear them down in the right places with wire brushes etc.

One of the bits of the uniform you probably won’t have seen – or even if you saw it you might not have noticed it – was the T-shirt: the army uniform of the day would have been rather sweaty, itchy and uncomfortable on a hot summer’s day, so the soldiers weren’t wearing the real thing. Anna and her colleague designed a T-shirt that looked like the front of the shirt the troops should have worn under their battledress made of a material that was thinner, softer and much more comfortable than the real thing. In the end the day wasn’t as hot as expected so very few volunteers seemed to unbutton their tops – but if they had done so the T-shirts would have appeared to be the real thing.

Walking the Walk.

Apart from the authenticity of the uniforms another major feature of the day was the way that the ghosts made their way around from place to place silently, in single file, with no apparent references to maps (or satnav). Every group had a carefully planned route and timetable and two stage managers wearing brightly coloured backpacks so that they could be seen easily by the soldiers but, since one who walked 50 metres ahead and one 50m behind, were unlikely to be noticed by anyone who wasn’t looking. The stage managers were following carefully planned and timetabled routes – allowing the soldiers to stay in character all the time.

You may have seen pictures of the troops on the various underground trains – that’s just one demonstration of the level of detailed planning that went into the day. With a tight timetable of action and previous communications to station masters and other public officials to ensure that there would be no hold-ups at critical points the lead stage manager could (for example) get to a station guard to warn them of the imminent arrival of a squad, show them the necessary travel cards, and get the gate held open for them. No need for WW1 ghosts to break character by fumbling for electronic travel cards, just a silent parade through an open gate.

Just as Anna was the Lead Costumer Supervisor, there was a Lead Stage Manager with a cascade of local route masters beneath her. She was based in Birmingham and was responsible for working out how to make the timetabling and routing possible, using her home town as the test-bed for the approach, then briefing the regional organizers who applied the methods to prepare routes and handle logistics for their own locations.

End Results

To the people of London and Manchester and Belfast and Swansea and Penzance and Shetland and half a dozen places around the UK, it just happened: hundreds of ghosts of the past appeared and wandered among us. The uniforms were “real”, the journeys from place to place were “spontaneous” and “unguided”, and the ghosts were haunting. To most of us “it just happened” but behind the scenes the effort involved in preparation, and the attention to detail was enormous.

Between the “headline” names at the top of the pyramid and the highly visible troops on the ground at the bottom, it took the coordinated efforts of about 500 people to turn a wonderful idea into a reality that moved millions of people in their daily lives.

 

If you want to see more images and comments about the day you can follow the hashtag #wearehere and there is a collection of instagram images at http://becausewearehere.co.uk/  and if you’re in the London area on 11th July and want to hear more about the instigation and implementation of the day there’s an evening event on at the National Theatre on Monday 11th July featuring Jenny Waldman, Jeremy Deller and Rufus Norris discussing the event with a BBC correspondent.

 

 


Index Sanity

Tue, 2016-06-28 02:43

By popular demand (well, one person emailed me to ask for it) I’m going to publish the source code for a little demo I’ve been giving since the beginning of the millenium – it concerns indexes and the potential side effects that you can get when you drop an index that you’re not using. I think I’ve mentioned the effect several times in the history of this blog, but I can’t find an explicit piece of demo code, so here it is – starting at the conclusion – as a cut and paste from an SQL*Plus session running against an 11g instance:


SQL> set autotrace traceonly explain
select
        t1.small_vc, t2.small_vc, t3.small_vc
from
        t1, t2, t3
where
        t1.n1 between 40 and 50
and     t2.id1 = t1.id1
and     t2.ind_pad = t1.ind_pad
and     t2.id2 = t1.id2
and     t3.id = t1.id1
 11  ;

Execution Plan
----------------------------------------------------------
Plan hash value: 1184213596

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   484 | 64856 |   227   (2)| 00:00:02 |
|*  1 |  HASH JOIN          |      |   484 | 64856 |   227   (2)| 00:00:02 |
|*  2 |   HASH JOIN         |      |   484 | 57596 |    14   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T2   |    20 |  1160 |     4   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| T1   |   484 | 29524 |    10   (0)| 00:00:01 |
|   5 |   TABLE ACCESS FULL | T3   |  5000 | 75000 |   213   (2)| 00:00:02 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T3"."ID"="T1"."ID1")
   2 - access("T2"."ID1"="T1"."ID1" AND "T2"."IND_PAD"="T1"."IND_PAD"
              AND "T2"."ID2"="T1"."ID2")
   4 - filter("T1"."N1"<=50 AND "T1"."N1">=40)

SQL> drop index t2_i1;

Index dropped.

select
        t1.small_vc, t2.small_vc, t3.small_vc
from
        t1, t2, t3
where
        t1.n1 between 40 and 50
and     t2.id1 = t1.id1
and     t2.ind_pad = t1.ind_pad
and     t2.id2 = t1.id2
and     t3.id = t1.id1
 11  ;

Execution Plan
----------------------------------------------------------
Plan hash value: 2290830436

--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |    52 |  6968 |    67   (2)| 00:00:01 |
|   1 |  NESTED LOOPS                |       |    52 |  6968 |    67   (2)| 00:00:01 |
|   2 |   NESTED LOOPS               |       |    52 |  6968 |    67   (2)| 00:00:01 |
|*  3 |    HASH JOIN                 |       |    52 |  6188 |    14   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL        | T2    |    20 |  1160 |     4   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL        | T1    |   484 | 29524 |    10   (0)| 00:00:01 |
|*  6 |    INDEX UNIQUE SCAN         | T3_PK |     1 |       |     0   (0)| 00:00:01 |
|   7 |   TABLE ACCESS BY INDEX ROWID| T3    |     1 |    15 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T2"."ID1"="T1"."ID1" AND "T2"."IND_PAD"="T1"."IND_PAD" AND
              "T2"."ID2"="T1"."ID2")
   5 - filter("T1"."N1"<=50 AND "T1"."N1">=40)
   6 - access("T3"."ID"="T1"."ID1")

Starting from the top – I’ve enabled autotrace which, technically, could mean that the plans are not the ones I’d see at run-time, but you can take my word for it that in 11g they are the run-time plans; then I’ve supplied a query that produces a plan with 3 full tablescans, two hash joins, and no index usage at all.

You’ll notice at operation 3 of the plan that table t2 is very small – only 20 rows selected, with no predicates that could have filtered that result down from a large table (take my word for it the stats have just been collected) so, as the ancient mythology would have it, we don’t really need an index on that table (a quick check tells me that the index wasn’t there to enforce uniqueness). Immediately after the first execution plan you can see that I’ve dropped an index called t2_i1 – trust me that IS the index on table t2.

We “run” the original query again, it gets re-optimised (and there’s no question of cardinality feedback or any other feature coming into play) and we get a different plan.

Dropping, or adding, a multi-column index to a table could change execution plan – even if the index is not present in the plan.

The reason for this is the “index sanity check”. When the optimizer is doing its cardinality estimates, if it see equality conditions on the set of columns that make up an index it can use the distinct_keys statistic from the index in the calculation rather than using the standard calculation of multiplying together the num_distinct of the separate columns. In earlier versions of Oracle there were some restrictions about uniqueness, but the limitations were removed in 11.1.0.7.

In my case there were 10 distinct values for id1, just one value for ind_pad, and 20 distinct values for id2 – but a total of only 20 distinct values for the combination. With an index in place on the combination the optimizer used the value 20 in its calculation, in the absence of the index it used the value 200 – that factor of 10 led to a drop in the join cardinality estimate from 484 rows to 52 rows – at which point the optimizer calculations made the next step in the plan change from a hash join to a nested loop join.

If you want to reproduce the demo, here’s the full script – the data isn’t a realistic data set, and I’ve had to use various non-standard settings to make the script as repeatable as possible – I’ve built the data set in a tablespace using an 8KB block size, 1MB uniform extents and manual (freelist) segment space management.


rem
rem     Script:         index_sanity.sql
rem     Author:         Jonathan Lewis
rem

drop table t3;
drop table t2;
drop table t1;

execute dbms_random.seed(0);

begin   
        begin           execute immediate 'purge recyclebin';
        exception       when others then null;
        end; 

        begin
                dbms_stats.set_system_stats('MBRC',16);
                dbms_stats.set_system_stats('MREADTIM',10);
                dbms_stats.set_system_stats('SREADTIM',5);
                dbms_stats.set_system_stats('CPUSPEED',1000);
        exception
                when others then null;
        end;

end;
/

create table t1
as
select
        mod(rownum,10)          id1,
        mod(rownum,20)          id2,
        rpad('x',40,'x')        ind_pad,
        mod(rownum,100)         n1,
        lpad(rownum,10,'0')     small_vc,
        rpad('x',50)            padding
from
        all_objects
where
        rownum  <= 4000
;

create table t2 
pctfree 99
pctused 1
as
select
        mod(rownum,10)          id1,
        mod(rownum,20)          id2,
        rpad('x',40,'x')        ind_pad,
        mod(rownum,100)         n1, 
        lpad(rownum,10,'0')     small_vc,
        rpad('x',200)           padding
from
        all_objects
where
        rownum <= 20
;

create table t3
pctfree 95
pctused 1
as
select
        rownum          id,
        rpad(rownum,10) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 5000
;
begin
        dbms_stats.gather_table_stats(
                ownname => user,
                tabname => 'T1',
                method_opt => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname => user,
                tabname => 'T2',
                method_opt => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname => user,
                tabname => 'T3',
                method_opt => 'for all columns size 1'
        );

end;
/

create        index t1_i1 on t1(id1, ind_pad, id2) pctfree 91;
create        index t2_i1 on t2(id1, ind_pad, id2) pctfree 91;
alter table t3 add constraint t3_pk primary key (id);

set autotrace traceonly explain

select
        t1.small_vc, t2.small_vc, t3.small_vc
from
        t1, t2, t3
where
        t1.n1 between 40 and 50
and     t2.id1 = t1.id1
and     t2.ind_pad = t1.ind_pad
and     t2.id2 = t1.id2
and     t3.id = t1.id1
;

-- alter index t1_i1 invisible;
-- alter index t2_i1 invisible;

drop index t1_i1;
-- drop index t2_i1;

accept X prompt "Press return to coninue"

select
        t1.small_vc, t2.small_vc, t3.small_vc
from
        t1, t2, t3
where
        t1.n1 between 40 and 50
and     t2.id1 = t1.id1
and     t2.ind_pad = t1.ind_pad
and     t2.id2 = t1.id2
and     t3.id = t1.id1
;

set autotrace off

You’ll notice from the commented lines in the above that the effect appears whether you drop the index or make it invisible, also that there’s a similar index on the t1 table that matches the index on the t2 table – I could get the effect from dropping or making invisible either index.

There is a saving grace in 11g – if I do drop, or make invisible, one of these indexes I can protect myself against the statistical effect by create a column group on the same set of columns, and the num_distinct from the column group would serve the same purpose as the distinct_keys from the index.


Never …

Fri, 2016-06-24 07:15

From time to time a question comes up on OTN that results in someone responding with the mantra: “Never do in PL/SQL that which can be done in plain  SQL”. It’s a theme I’ve mentioned a couple of times before on this blog, most recently with regard to Bryn Llewellyn’s presentation on transforming one table into another and Stew Ashton’s use of Analytic functions to solve a problem that I got stuck with.

Here’s a different question that challenges that mantra. What’s the obvious reason why someone might decide to produce the following code rather than writing a simple “insert into t1 select * from t2;”:


declare

        cursor c1 is
        select * from t2
        ;

        type c1_array is table of c1%rowtype index by binary_integer;
        m_tab c1_array;

begin

        open c1;
        loop
                fetch c1
                bulk collect into m_tab limit 100;

                begin
                        forall i in 1..m_tab.count
                                insert into t1 values m_tab(i);
                exception
                        when others
                                then begin
                                        --  proper exception handling should go here
                                        dbms_output.put_line(m_tab(1).id);
                                        dbms_output.put_line(sqlerrm);
                                end;
                end;

                exit when c1%notfound;

        end loop;
        close c1;
end;
/

There is a very good argument for this approach.


Conditional SQL- 6

Wed, 2016-06-22 08:16

An odd little anomaly showed up on the OTN database forum a few days ago where a query involving a table covered by Oracle Label Security (OLS) seemed to wrap itself into a non-mergeable view when written using traditional Oracle SQL, but allowed for view-merging when accessed through ANSI standard SQL. I don’t know why there’s a difference but it did prompt a thought about non-mergeable views and what I’ve previously called “conditional SQL” – namely SQL which holds a predicate that should have been tested in the client code and not passed to the database engine.

The thought was this – could the database engine decide to do a lot of redundant work if you stuck a silly predicate inside a non-mergeable view: the answer turns out to be yes. Here’s a demonstration I’ve run on 11g and 12c:


rem
rem     Script:         conditional_fail.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2016
rem

create table t2
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level &lt;= 1e4
)
select
        rownum                  id,
        mod(rownum,1e5)         n1,
        lpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum &lt;= 1e6
;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level &lt;= 1e4
)
select
        rownum                  id,
        mod(rownum,1e3)         n1,
        lpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum &lt;= 1e5
;
-- gather simple statistics (not needed in 12c) 

begin
        dbms_stats.gather_table_stats(
                ownname          =&gt; user,
                tabname          =&gt; 'T1',
                method_opt       =&gt; 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          =&gt; user,
                tabname          =&gt; 'T2',
                method_opt       =&gt; 'for all columns size 1'
        );
end;
/

create index t2_i1 on t2(id);

variable b1 number
variable b2 number

exec :b1 := 1; :b2 := 0

There’s nothing terribly significant about the data, beyond the fact that I’ve got a “small” table and a “large” table that I can use to encourage the optimizer to do a hash join. I’ve also created a couple of bind variables and set them to values that ensure that we can see that b1 is definitely not smaller than b2. So here’s a simple query – with a mergeable inline view in the first instance which is then hinted to make the view non-mergeable.


select
        t1.n1, count(*), sum(v1.n1)
from
        t1,
        (select t2.id, t2.n1 from t2 where :b1 &lt; :b2) v1
where
        t1.n1 = 0
and     v1.id = t1.id
group by
        t1.n1
;

select
        t1.n1, count(*), sum(v1.n1)
from
        t1,
        (select /*+ no_merge */ t2.id, t2.n1 from t2 where :b1 &lt; :b2) v1
where
        t1.n1 = 0
and     v1.id = t1.id
group by
        t1.n1
;

Clearly, for our values of b1 and b2, the query will not return any data. In fact we can go further and say that the presence of the “bind variable predicate” in the inline view either has no effect on the volume of data returned or it eliminates all the data. But the presence of the no_merge hint makes a difference to how much work Oracle does for the “no data” option. Here are the two plans, pulled from the memory of an 11g instance after enabling rowsource execution statistics – first when the view is mergeable:


-----------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |       |      1 |        |   508 (100)|      0 |00:00:00.01 |
|   1 |  SORT GROUP BY NOSORT          |       |      1 |      1 |   508   (2)|      0 |00:00:00.01 |
|*  2 |   FILTER                       |       |      1 |        |            |      0 |00:00:00.01 |
|   3 |    NESTED LOOPS                |       |      0 |    100 |   508   (2)|      0 |00:00:00.01 |
|   4 |     NESTED LOOPS               |       |      0 |    100 |   508   (2)|      0 |00:00:00.01 |
|*  5 |      TABLE ACCESS FULL         | T1    |      0 |    100 |   208   (4)|      0 |00:00:00.01 |
|*  6 |      INDEX RANGE SCAN          | T2_I1 |      0 |      1 |     2   (0)|      0 |00:00:00.01 |
|   7 |     TABLE ACCESS BY INDEX ROWID| T2    |      0 |      1 |     3   (0)|      0 |00:00:00.01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(:B1&lt;:B2)
   5 - filter(&quot;T1&quot;.&quot;N1&quot;=0)
   6 - access(&quot;T2&quot;.&quot;ID&quot;=&quot;T1&quot;.&quot;ID&quot;)

Notice how, despite t2 being the second table in the join, the bind variable predicate has worked its way to the top of the execution plan and execution has terminated after the run-time engine has determined that 1 is not less than zero.

Compare this with the plan when the view is non-mergeable:

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |         |      1 |        |  2300 (100)|      0 |00:00:00.01 |    1599 |       |       |          |
|   1 |  SORT GROUP BY NOSORT  |         |      1 |      1 |  2300   (4)|      0 |00:00:00.01 |    1599 |       |       |          |
|*  2 |   HASH JOIN            |         |      1 |    100 |  2300   (4)|      0 |00:00:00.01 |    1599 |  2061K|  2061K| 1109K (0)|
|   3 |    JOIN FILTER CREATE  | :BF0000 |      1 |    100 |   208   (4)|    100 |00:00:00.01 |    1599 |       |       |          |
|*  4 |     TABLE ACCESS FULL  | T1      |      1 |    100 |   208   (4)|    100 |00:00:00.01 |    1599 |       |       |          |
|   5 |    VIEW                |         |      1 |   1000K|  2072   (3)|      0 |00:00:00.01 |       0 |       |       |          |
|*  6 |     FILTER             |         |      1 |        |            |      0 |00:00:00.01 |       0 |       |       |          |
|   7 |      JOIN FILTER USE   | :BF0000 |      0 |   1000K|  2072   (3)|      0 |00:00:00.01 |       0 |       |       |          |
|*  8 |       TABLE ACCESS FULL| T2      |      0 |   1000K|  2072   (3)|      0 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(&quot;V1&quot;.&quot;ID&quot;=&quot;T1&quot;.&quot;ID&quot;)
   4 - filter(&quot;T1&quot;.&quot;N1&quot;=0)
   6 - filter(:B1&lt;:B2)
   8 - filter(SYS_OP_BLOOM_FILTER(:BF0000,&quot;T2&quot;.&quot;ID&quot;))

Thanks to the no_merge hint the bind variable predicate has not been promoted to the top of the plan, so the run-time engine has produced a plan that requires it to access data from table t1 before visiting table t2. In fact the optimizer has decided to do a hash join gathering all the relevant data from t1 and building an in-memory hash table before deciding that 1 is greater than zero and terminating the query.

Be careful if you write SQL that compares bind variables (or other pseudo-constants such as calls to sys_context) with bind variables (etc.); you may find that you’ve managed to produce code that forces the optimizer to do work that it could have avoided if only it had found a way of doing that comparison at the earliest possible moment.

These plans were from 11g, but 12c can behave the same way although, with my specific data set, I had to add the no_push_pred() hint to the query to demonstrate the effect of hash join appearing.

Footnote 1

An interesting side effect of this plan is that it has also allowed a Bloom filter to appear in a serial hash join – not something you’d normally expect to see, so I changed the predicate to :b1 > :b2 to see if the no_merge hint combined with the Bloom filter was faster than merging without the Bloom filter. Here are the two plans, pulled from memory. Running this on 11g I had to hint the hash join when I removed the no_merge hint:


Non-mergeable view - Bloom filter appears
------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |         |      1 |        |  2300 (100)|      1 |00:00:00.12 |   17725 |       |       |          |
|   1 |  SORT GROUP BY NOSORT  |         |      1 |      1 |  2300   (4)|      1 |00:00:00.12 |   17725 |       |       |          |
|*  2 |   HASH JOIN            |         |      1 |    100 |  2300   (4)|    100 |00:00:00.12 |   17725 |  2061K|  2061K| 1106K (0)|
|   3 |    JOIN FILTER CREATE  | :BF0000 |      1 |    100 |   208   (4)|    100 |00:00:00.01 |    1599 |       |       |          |
|*  4 |     TABLE ACCESS FULL  | T1      |      1 |    100 |   208   (4)|    100 |00:00:00.01 |    1599 |       |       |          |
|   5 |    VIEW                |         |      1 |   1000K|  2072   (3)|   1605 |00:00:00.10 |   16126 |       |       |          |
|*  6 |     FILTER             |         |      1 |        |            |   1605 |00:00:00.09 |   16126 |       |       |          |
|   7 |      JOIN FILTER USE   | :BF0000 |      1 |   1000K|  2072   (3)|   1605 |00:00:00.08 |   16126 |       |       |          |
|*  8 |       TABLE ACCESS FULL| T2      |      1 |   1000K|  2072   (3)|   1605 |00:00:00.07 |   16126 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(&quot;V1&quot;.&quot;ID&quot;=&quot;T1&quot;.&quot;ID&quot;)
   4 - filter(&quot;T1&quot;.&quot;N1&quot;=0)
   6 - filter(:B1&gt;B2)
   8 - filter(SYS_OP_BLOOM_FILTER(:BF0000,&quot;T2&quot;.&quot;ID&quot;))

View merging allowed - no Bloom filter
-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |  2300 (100)|      1 |00:00:07.56 |   17725 |       |       |          |
|   1 |  SORT GROUP BY NOSORT|      |      1 |      1 |  2300   (4)|      1 |00:00:07.56 |   17725 |       |       |          |
|*  2 |   FILTER             |      |      1 |        |            |    100 |00:00:07.56 |   17725 |       |       |          |
|*  3 |    HASH JOIN         |      |      1 |    100 |  2300   (4)|    100 |00:00:07.56 |   17725 |  2061K|  2061K| 1446K (0)|
|*  4 |     TABLE ACCESS FULL| T1   |      1 |    100 |   208   (4)|    100 |00:00:00.01 |    1599 |       |       |          |
|   5 |     TABLE ACCESS FULL| T2   |      1 |   1000K|  2072   (3)|   1000K|00:00:01.94 |   16126 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(:B1&gt;:B2)
   3 - access(&quot;T2&quot;.&quot;ID&quot;=&quot;T1&quot;.&quot;ID&quot;)
   4 - filter(&quot;T1&quot;.&quot;N1&quot;=0)

Things to note – the Bloom filter eliminated all but 1,605 rows from the tablescan before passing them to the hash join operation to probe the hash table; the run time of the query without filtering was 7.56 seconds (!) compared to 0.12 seconds with the Bloom filter. Fortunately I decided that this was too good to be true BEFORE I published the results and decided to re-run the tests with statistics_level set back to typical and found that most of the difference was CPU time spent on collecting rowsource execution statistics. The query with the Bloom filter was still faster, but only just – the times were more like 0.09 seconds vs. 0.12 seconds.

Footnote 2

The source of the problem on OTN was that as well as using OLS the query in question included a user-defined function. Since you can write a user-defined function that “spies” on the data content and uses (e.g.) dbms_output to write data to the terminal this poses a security risk; if a predicate calling that function executed before the security predicate had been tested then your function could output data that your query shouldn’t be able to report. To avoid this security loophole Oracle restricts the way it merges views (unless you set optimizer_secure_view_merging to false). For a more detailed explanation and demonstration of the issues, see this item on Christian Antognini’s blog.

I still don’t know why the ANSI form of the query managed to bypass this issue, but the predicate with the user-defined function was applied as a filter at the very last step of the plan, so perhaps there was something about the transformation Oracle took to get from ANSI syntax to its internal syntax (with cascading lateral views) that made it possible for the optimizer to recognize and eliminate the security threat efficiently.


Plan Shapes

Mon, 2016-06-20 06:58

There are a number of articles, webinars, and blogs online about how to read execution plans, but many of them seem to stop after the the minimum description of the simplest type of plan, so I thought I’d throw out a brief comment on a couple the slightly more complicated things that are likely to appear fairly commonly because you sometimes find plans with very similar shapes but extremely different interpretation.

First: select with scalar subqueries in the select list (there’s no need to worry about what the table definitions look like):


rem
rem     Script:         plan_shapes.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2016
rem

select
        id, n1,
        (select /*+  no_unnest */ max(n1) from t2 where t2.id = t1.n1)  t2_max,
        (select /*+  no_unnest */ max(n1) from t3 where t3.id = t1.n1)  t3_max
from
        t1
where
        id between 100 and 119
;


--------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |      1 |        |     20 |00:00:00.01 |       8 |
|   1 |  SORT AGGREGATE                      |       |     20 |      1 |     20 |00:00:00.01 |      63 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T2    |     20 |      2 |     40 |00:00:00.01 |      63 |
|*  3 |    INDEX RANGE SCAN                  | T2_I1 |     20 |      2 |     40 |00:00:00.01 |      23 |
|   4 |  SORT AGGREGATE                      |       |     20 |      1 |     20 |00:00:00.01 |      83 |
|   5 |   TABLE ACCESS BY INDEX ROWID BATCHED| T3    |     20 |      3 |     60 |00:00:00.01 |      83 |
|*  6 |    INDEX RANGE SCAN                  | T3_I1 |     20 |      3 |     60 |00:00:00.01 |      23 |
|   7 |  TABLE ACCESS BY INDEX ROWID BATCHED | T1    |      1 |     21 |     20 |00:00:00.01 |       8 |
|*  8 |   INDEX RANGE SCAN                   | T1_I1 |      1 |     21 |     20 |00:00:00.01 |       4 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("ID">=100 AND "ID"<=119)

We have a select statement, with two columns in the select list generated by (correlated) scalar subqueries.

The overall shape of the plan shows the driving query as the last child plan for the SELECT (operations 7-8). The first and second child plans are the plans for the two scalar subqueries in turn (and the order the sub-plans appear is the order of the scalar subqueries in the select list). In this case the main query returned 20 rows (A-Rows=20), and the scalar subqueries executed 20 times each. There are a few other details we could mention, but the key feature of the plan is that the driver is the last sub-plan.

Second: update with scalar subqueries:


update t1
set
        n1 = (select max(n1) from t2 where t2.id = t1.n1),
        v1 = (select max(v1) from t3 where t3.id = t1.n1)
where
        id between 1000 and 1999
;

---------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                      |       |      1 |        |      0 |00:00:00.13 |   10361 |
|   1 |  UPDATE                               | T1    |      1 |        |      0 |00:00:00.13 |   10361 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED | T1    |      1 |   1001 |   1000 |00:00:00.01 |      22 |
|*  3 |    INDEX RANGE SCAN                   | T1_I1 |      1 |   1001 |   1000 |00:00:00.01 |       4 |
|   4 |   SORT AGGREGATE                      |       |    916 |      1 |    916 |00:00:00.04 |    3672 |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED| T2    |    916 |      2 |   1832 |00:00:00.03 |    3672 |
|*  6 |     INDEX RANGE SCAN                  | T2_I1 |    916 |      2 |   1832 |00:00:00.01 |    1840 |
|   7 |   SORT AGGREGATE                      |       |    916 |      1 |    916 |00:00:00.05 |    4588 |
|   8 |    TABLE ACCESS BY INDEX ROWID BATCHED| T3    |    916 |      3 |   2748 |00:00:00.04 |    4588 |
|*  9 |     INDEX RANGE SCAN                  | T3_I1 |    916 |      3 |   2748 |00:00:00.01 |    1840 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("ID">=1000 AND "ID"<=1999)
   6 - access("T2"."ID"=:B1)
   9 - access("T3"."ID"=:B1)


In this statement we update two columns by executing (correlated) scalar subqueries. The most important feature of interpreting this plan is that it is the exact opposite of the earlier select statement. In this plan the first subplan is the driving part of the statement – it’s the part of the plan that tells us how to find rows to be updated (and we find 1,000 of them); the 2nd and 3rd sub-plans correspond to the subqueries in the two columns whose value we set. In this case we happen to get some benefit from scalar subquery caching so the two subqueries each run 916 times. (The fact that both subqueries run the same number of times is not a coincidence – the caching benefit is dependent on the value(s) used for driving the subqueries, and that’s the t1.n1 column in both cases.)

Finally: because people do all sorts of complicated things, and it’s easy to get deceived if you read the plan without seeing the SQL, one more example (starting with the plan):


---------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                      |       |      1 |        |      0 |00:00:00.14 |   11257 |
|   1 |  UPDATE                               | T1    |      1 |        |      0 |00:00:00.14 |   11257 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED | T1    |      1 |   1001 |   1000 |00:00:00.01 |      22 |
|*  3 |    INDEX RANGE SCAN                   | T1_I1 |      1 |   1001 |   1000 |00:00:00.01 |       4 |
|   4 |   SORT AGGREGATE                      |       |   1103 |      1 |   1103 |00:00:00.06 |    5519 |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED| T3    |   1103 |      3 |   3306 |00:00:00.04 |    5519 |
|*  6 |     INDEX RANGE SCAN                  | T3_I1 |   1103 |      3 |   3306 |00:00:00.01 |    2213 |
|   7 |   SORT AGGREGATE                      |       |    916 |      1 |    916 |00:00:00.11 |    9191 |
|   8 |    TABLE ACCESS BY INDEX ROWID BATCHED| T2    |    916 |      2 |   1832 |00:00:00.03 |    3672 |
|*  9 |     INDEX RANGE SCAN                  | T2_I1 |    916 |      2 |   1832 |00:00:00.01 |    1840 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("ID">=1000 AND "ID"<=1999)
   6 - access("T3"."ID"=:B1)
   9 - access("T2"."ID"=:B1)


This plan looks very like the previous plan – it’s an update so the first sub-plan will be identifying the rows to be updated, but then what ?

The easiest assumption – always one worth considering, providing you remember that there are other possibilities – is that we have two subqueries executing for each row we update – but in this example there’s a little clue in the numbers that this isn’what the statement does. The driving query (operations 2 – 3) identifies 1,000 rows, so how can operations 4 – 6 start 1,103 times each ? Something more complicated is going on.

I’ve combined the features of the first two examples. I have a single set subquery, but it contains a scalar subquery in the select list – so operations 4 – 9 are the sub-plan for a single select statement – and a select statement with a scalar subquery in the select list puts the main plan last (operations 7 – 9) and its scalar subquery sub-plan first (operations 4 – 6). Here’s the SQL:


update t1
set
        (n1,v1) = (
                select
                        max(n1),
                        max((select /*+  no_unnest */ max(v1) from t3 where t3.id = t2.n1))
                from    t2
                where   t2.id = t1.n1
        )
where
        id between 1000 and 1999
;

What happens here is that the driving body of the update statement identifies 1,000 rows so the scalar subquery against t2 should execute 1,000 times; thanks to scalar subquery caching, though, it only executes 916 times. Each time it executes it finds 2 row and for each of those rows it executes the scalar subquery against t3 which, fortunately, also benefits from its own scalar subquery caching and so runs only 1,103 times in total

I could go on, of course, with increasingly complex examples – for example scalar subqueries that contain decode() calls with scalar subqueries as their inputs; where clauses which compare the results of scalar subqueries, and so on. It can get quite difficult to see, even with the SQL in front of you, what the plan is trying to tell you so I’ll leave you with one thought: always use the qb_name() hint to name every single subquery so that, if interpretation of the plan gets a bit difficult, you can use the ‘alias’ formatting option in the call to dbms_xplan to connect each table reference in the plan with the query block it came from in the query.

 

 


Pages