Database 12c: automatic generation of extended statistics

articles: 

Why does the CBO get it wrong? Often because it has insufficient information. No matter how often you analyze your tables, if your queries use multi-column predicates, the CBO will mis-calculate the cardinalities. You have to understand your data, and create extended statistics to correlate the columns. How many people do this? Hardly any. Not a problem any more: release 12c can do this for you. If you configure it appropriately.

When discussing query performance, I always say two things: "you must understand your data" and "you must tell Oracle what you know". In this blog, I want to talk about just one aspect of this: correlation between columns.

The problem

Consider this query (working in the SCOTT demonstration schema):
select * from emp where job='SALESMAN' and deptno=30;
We know that there are six people in department 30, and that there are four salesmen. And so does Oracle, almost exactly:

orclz> set autot on exp
orclz> select count(*) from emp where deptno=30;

  COUNT(*)
----------
         6


Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| EMP  |     6 |    18 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DEPTNO"=30)

orclz> select count(*) from emp where  job='SALESMAN';

  COUNT(*)
----------
         4


Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| EMP  |     3 |    24 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("JOB"='SALESMAN')

orclz>

We also now that department 30 is SALES, and that is the department to which all the sales droids belong. But Oracle doesn't know this:
orclz>
orclz> select count(*) from emp where job='SALESMAN' and  deptno=30;

  COUNT(*)
----------
         4


Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    11 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    11 |            |          |
|*  2 |   TABLE ACCESS FULL| EMP  |     1 |    11 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("JOB"='SALESMAN' AND "DEPTNO"=30)

orclz>

The cardinality estimate is out by a factor of four. Why? Because Oracle has multiplied the estimated selectivity factors: deptno=30 is 6/14, job='SALESMAN' is 3/14, the result (rounded) is 1/14. Scale this up to the real world, and this is a disaster. In particular it will affect the join order of tables in queries. And we all know (I hope) that if there is any such thing as a silver bullet for SQL tuning, it is to get the join order right.
The 11g solution is to tell Oracle what we know, by creating extended statistics:
orclz> select dbms_stats.create_extended_stats(user,'emp','(deptno,job)') from dual;

DBMS_STATS.CREATE_EXTENDED_STATS(USER,'EMP','(DEPTNO,JOB)')
------------------------------------------------------------------------------------------------------------------------
SYS_STU$7LJEWQEV#_NNT_P4FXAU5K

orclz> set autot on exp
orclz> select count(*) from emp where job='SALESMAN' and  deptno=30;

  COUNT(*)
----------
         4


Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    19 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    19 |            |          |
|*  2 |   TABLE ACCESS FULL| EMP  |     4 |    76 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("JOB"='SALESMAN' AND "DEPTNO"=30)

orclz> 

so now Oracle gets it right. But YOU have to understand your data, and YOU have to do the work. Look at your data - there will be any number of correlations: between the parts of an address; between job and salary; between product, season, and location; on how many of these have you created extended stats? In many databases, none.

A 12c solution

12c can automate the process. First, instruct Oracle to monitor the usage of columns in predicates for a period during which time you run your usual workload (for this example, I've re-initialized the SCOTT schema, and monitor for just one minute). Second, generate a report on how columns where used. Third, let Oracle create the extended stats it thinks necessary. Fourth, accept the praise for your phenomenal success at tuning:

orclz>
orclz> exec dbms_stats.seed_col_usage(null,null,60)

PL/SQL procedure successfully completed.

Commit complete.
orclz> select count(*) from emp where job='SALESMAN' and  deptno=30;

  COUNT(*)
----------
         4

orclz> set long 100000
orclz> select dbms_stats.report_col_usage('scott','emp') from dual;

DBMS_STATS.REPORT_COL_USAGE('SCOTT','EMP')
--------------------------------------------------------------------------------
LEGEND:
.......

EQ         : Used in single table EQuality predicate
RANGE      : Used in single table RANGE predicate
LIKE       : Used in single table LIKE predicate
NULL       : Used in single table is (not) NULL predicate
EQ_JOIN    : Used in EQuality JOIN predicate
NONEQ_JOIN : Used in NON EQuality JOIN predicate
FILTER     : Used in single table FILTER predicate
JOIN       : Used in JOIN predicate
GROUP_BY   : Used in GROUP BY expression
...............................................................................

###############################################################################

COLUMN USAGE REPORT FOR SCOTT.EMP
.................................

1. DEPTNO                              : EQ
2. JOB                                 : EQ
3. (JOB, DEPTNO)                       : FILTER
###############################################################################

orclz> select dbms_stats.create_extended_stats('scott','emp') from dual;

DBMS_STATS.CREATE_EXTENDED_STATS('SCOTT','EMP')
--------------------------------------------------------------------------------
###############################################################################

EXTENSIONS FOR SCOTT.EMP
........................

1. (JOB, DEPTNO)                       : SYS_STU3VG629OEYG6FN0EKTGV_HQ6 created
###############################################################################


orclz> set autot on exp
orclz> select count(*) from emp where job='SALESMAN' and  deptno=30;

  COUNT(*)
----------
         4


Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    19 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    19 |            |          |
|*  2 |   TABLE ACCESS FULL| EMP  |     4 |    76 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("JOB"='SALESMAN' AND "DEPTNO"=30)

orclz> 

So now Uncle Oracle gets it right - and you don't have to understand the data any more. This really is SQL statement tuning made easy.

All demonstrations done using release 12.1.0.1 on Windows.
--
John Watson
Oracle Certified Master DBA
http://skillbuilders.com

Comments

If you continued testing you would discover that the automatic creation also automatically invalidates any packages that reference the table, effectively creating random invalidations. This then results in ORA-4068 errors and failing applications. Only solution is to disable extended statitics. Hope you haven't made use of this anywhere!

All I can say is that I've been using extended stats (or "column group stats" as some people refer to them) for many years and have not seen this issue. Adding the hidden column will of course invalidate any dependent objects, but there is no reason why they will not recompile on demand.
If you care to provide an example, I'm sure people will find it valuable.