Re: A question about huge difference in cardinality of a query with 3 predicates

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Sun, 29 Oct 2017 14:55:06 +0000
Message-ID: <LOXP123MB00865B2A1CBB86B6F6E979D5A5580_at_LOXP123MB0086.GBRP123.PROD.OUTLOOK.COM>


Nirav,

Thanks for posting so much relevant information. There are several possible effects visible here. The first is simply that to the optimizer the selectivity of a

combination of columns is the product of the individual selectivities. This is why you get an estimate of 142 when you drop the histograms:

        143 = 2231714 * (1/3 * 1/50 * 1/104)

Secondly (as Andrew says), if one of your predicates goes out of range, Oracle scales down its selectivity by a measure of how far out of range it is.

Third, when a column has a frequency histogram on it and you ask for a value that doesn't seem to be in the histogram then Oracle uses "half the least popular" value as the selectivity (and then scales that if you're also out of range). You can see that two of your frequency histograms have "num_buckets" less than "num_distinct", so the gather for the histogram must have missed some values (or num_buckets would equal num_distinct). Notice that the sample for the histograms is only 5,548 rows, so with a heavy skew (which I think you must have to get 10,000 rows as the final result) it's not surprising that you've missed a few values somewhere. You may do better if you gather histograms on these columns with a sample size of 100% - but you'd still run into the "combination of columns" problem.

So - you need to create a column group (dbms_stats.create_extended_stats) across all three columns or (possibly) on the two columns that are most closely correlated; and you'll need a histogram on the extended stats or the individual histograms will make the optimizer ignore the benefit of the column group. Even then, unfortunately, you may run into problems because you really need the column group histogram to be a frequency histogram and you may have too many combinations across the three columns to get one (hence my comment about picking two out of three).

Here's a search link to a few notes that may be useful: https://jonathanlewis.wordpress.com/?s=column+group+histogram.

Regards
Jonathan Lewis



From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> on behalf of Nirav A Shah <shivam71_at_hotmail.com> Sent: 28 October 2017 16:52:23
To: oracle-l_at_freelists.org
Subject: Re: A question about huge difference in cardinality of a query with 3 predicates

Hi All,

I am having an issue of a query getting a bad plan -which I have narrowed down to major mismatch between estimated and actual cardinality of one of the tables involved in the query. I need

help from experts on understanding why the cardinality estimate is wrong and what can be done to rectify. For this purpose I have created a test table with required columns and tried to

create same situation (of cardinality mismatch) and following are the details on it. (Oracle version is : 11.2.0.4 on Linux.)

Table structure: CREATE TABLE processes

(

   process_id NUMBER (12) NOT NULL PRIMARY KEY,

   clientid VARCHAR2 (20) NOT NULL,

   evt_type_id NUMBER (12) NOT NULL,

   status VARCHAR2 (20) NOT NULL

)

Following are the stats about the table (Thanks to the print_table utility of Tom Kyte Sir - the below is output from "select * from all_tables where table_name='PROCESSES'" from

which several default stuff is remove to keep this a little brief)

OWNER                         : TEST

TABLE_NAME                    : PROCESSES

PCT_FREE                      : 10

NUM_ROWS                      : 2231714

BLOCKS                        : 12137

AVG_ROW_LEN                   : 31

SAMPLE_SIZE                   : 2231714



Here is the query and test data :

variable v_coid varchar2(16);

variable v_status varchar2(16);

variable v_evt_type_id number;

exec :v_coid := 'G31MQQNM99ABCA9V';

exec :v_status := 'INP';

exec :v_event_type_id :=5014;

SQL> SELECT COUNT (*) FROM processes wpi WHERE WPI.CLIENTID = :v_coid AND wpi.status = :v_status AND WPI.evt_type_id = :v_event_type_id;

  COUNT(*)


     10404

1 row selected.

Following is the Execution plsn:

Plan hash value: 3304333532


| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |


| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.08 | 11406 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.08 | 11406 | |* 2 | TABLE ACCESS FULL| PROCESSES | 1 | 62 | 10404 |00:00:00.08 | 11406 |


Peeked Binds (identified by position):


   1 - (VARCHAR2(30), CSID=873): 'G31MQQNM99ABCA9V'    2 - (VARCHAR2(30), CSID=873): 'INP' PLAN_TABLE_OUTPUT


   3 - (NUMBER): 3004 Predicate Information (identified by operation id):


   2 - filter(("WPI"."CLIENTID"=:V_COID AND "WPI"."STATUS"=:V_STATUS AND

              "WPI"."EVT_TYPE_ID"=:V_EVT_TYPE_ID)) 29 rows selected.

Here we can see the Huge difference - the estimated cardinality of just 62 versus the acutal cardinality of 10404. Note that all these 3 columns involved in the query have frequency histograms

on them and following are details on it...and this is the issue - the difference between estimated cardinality of 62 versus actual value of 10404 that I need help to sort out.

exec print_table('select * from all_tab_col_statistics where owner=''TEST'' AND TABLE_NAME=''PROCESSES''');

OWNER                         : TEST -- same for all the columns below

TABLE_NAME                    : PROCESSES -- same for all the columns below

COLUMN_NAME                   : PROCESS_ID

NUM_DISTINCT                  : 2231714

LOW_VALUE                     : C50D01015C04

HIGH_VALUE                    : C5170B1A5218

DENSITY                       : .000000444209506971868

NUM_NULLS                     : 0

NUM_BUCKETS                   : 254

LAST_ANALYZED                 : 20-oct-2017 04:20:20 --for all columns below too...the stats are current.

SAMPLE_SIZE                   : 5548

AVG_COL_LEN                   : 7

HISTOGRAM                     : HEIGHT BALANCED


-----------------
OWNER : TEST TABLE_NAME : PROCESSES COLUMN_NAME : CLIENTID NUM_DISTINCT : 104 LOW_VALUE : 30324652425132433157563030303936 HIGH_VALUE : 47354D384A4D54394545484A51475859 DENSITY : .000000222104676853238 NUM_NULLS : 0 NUM_BUCKETS : 87 SAMPLE_SIZE : 5548 AVG_COL_LEN : 17 HISTOGRAM : FREQUENCY
-----------------
COLUMN_NAME : EVT_TYPE_ID NUM_DISTINCT : 50 LOW_VALUE : C102 HIGH_VALUE : C21F07 DENSITY : .000000222104676853238 NUM_NULLS : 0 NUM_BUCKETS : 38 SAMPLE_SIZE : 5548 AVG_COL_LEN : 4 HISTOGRAM : FREQUENCY
-----------------
COLUMN_NAME : STATUS NUM_DISTINCT : 3 LOW_VALUE : 434F4D HIGH_VALUE : 494E50 DENSITY : .000000222104676853238 NUM_NULLS : 0 NUM_BUCKETS : 3 SAMPLE_SIZE : 5548 AVG_COL_LEN : 4 HISTOGRAM : FREQUENCY
-----------------

Please let me know if I should provide any other details. One thing I found was that if I do dynamic sampling level 8 then the cardinality estimate improved to nearly 7000 (don't recall that exact figure),but I would not like to depend on it. The other thing is if I drop the histograms then the cardinality improved slightly - to 143 as shown in the plan below but that is still way off the actual value which is more than 10K (10404 to be exact).


| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |


| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.08 | 11406 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.08 | 11406 | |* 2 | TABLE ACCESS FULL| PROCESSES | 1 | 143 | 10404 |00:00:00.08 | 11406 |


Peeked Binds (identified by position):


   1 - (VARCHAR2(30), CSID=873): 'G31MQQNMH3TW7A9V'    2 - (VARCHAR2(30), CSID=873): 'INP' PLAN_TABLE_OUTPUT


   3 - (NUMBER): 3004 Predicate Information (identified by operation id):


   2 - filter(("WPI"."CLIENTID"=:V_COID AND "WPI"."EVT_TYPE_ID"=:V_EVT_TYPE_ID

              AND "WPI"."STATUS"=:V_STATUS)) Regards, Nirav

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Oct 29 2017 - 15:55:06 CET

Original text of this message