Re: DBMS_STATS gives wrong stats for some columns/indexes

From: ddf <oratune_at_msn.com>
Date: Mon, 18 May 2009 11:50:57 -0700 (PDT)
Message-ID: <b1a01671-0d85-4707-95f9-a34e4a653061_at_h23g2000vbc.googlegroups.com>



Comments embedded.

On May 17, 7:49 pm, ca111..._at_gmail.com wrote:
> Hi all,
>
> I noticed that DBMS_STATS gives wrong statistics for some columns/
> indexes - it looks like they forgot the multiplier to compensate for
> sample size.
>
> Example: a table has approx 3.8 million rows. This is what I get for a
> certain column:
> 5% sample size - 80,000 distinct values
> 10% sample size -149,000 distinct values
> 20% sample size - 286,000 distinct values
> 50% sample size - 704,000 distinct values
> 100% sample size - 1,357,000 distinct values
>
> DBMS_STATS.GATHER_TABLE_STATS uses SELECT ... SAMPLE, it is easy to
> see by enabling tracing before running DBMS_STATS.
>

Sorry, I don't see it in the trace f9le I've generated -- could you post the entry from your trace file or tkprof output showing this action?

> Result of SELECT ... SAMPLE needs to be multiplied by 1/(sample size)
> to get
> the actual value.

Why? SAMPLE is not guaranteed to return x% of the rows of a table:

"sample_percent
For sample_percent, specify the percentage of the total row or block count to be included in the sample. The value must be in the range . 000001 to, but not including, 100. This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."

> For example, if a table has 1,000,000 rows and you
> run
> SELECT COUNT(*) SAMPLE 5% then it will return 50,000.

Not necessarily; consecutive executions of

select owner, object_name
from myobjects
sample(5);

produced the following record counts:

3490 rows selected.
3419 rows selected.
3336 rows selected.
3408 rows selected.
3397 rows selected.
3367 rows selected.
3423 rows selected.
3488 rows selected.
3500 rows selected.
3445 rows selected.
3403 rows selected.
3430 rows selected.
3453 rows selected.
3371 rows selected.
3504 rows selected.
3436 rows selected.
3336 rows selected.
3509 rows selected.
3394 rows selected.
3440 rows selected.

The total number of rows in the table:

SQL> select count(*)
  2 from myobjects;

  COUNT(*)


     68454

Five percent of that number is:

SQL> select (count(*))*.05
  2 from myobjects
  3 /

(COUNT(*))*.05


        3422.7

SQL> Out of 20 runs only one returned the expected result (run #7):

3423 rows selected.

The others were close, true, but not close enough to ensure a repeatable result of the multiplication you suggest:

SQL> select 3490*(1/.05) from dual;

3490*(1/.05)


       69800

SQL> select 3419 *(1/.05) from dual;

3419*(1/.05)


       68380

SQL> select 3336 *(1/.05) from dual;

3336*(1/.05)


       66720

SQL> select 3408 *(1/.05) from dual;

3408*(1/.05)


       68160

SQL> select 3397 *(1/.05) from dual;

3397*(1/.05)


       67940

SQL> select 3367 *(1/.05) from dual;

3367*(1/.05)


       67340

SQL> select 3423 *(1/.05) from dual;

3423*(1/.05)


       68460

SQL> select 3488 *(1/.05) from dual;

3488*(1/.05)


       69760

SQL> select 3500 *(1/.05) from dual;

3500*(1/.05)


       70000

SQL> select 3445 *(1/.05) from dual;

3445*(1/.05)


       68900

SQL> select 3403 *(1/.05) from dual;

3403*(1/.05)


       68060

SQL> select 3430 *(1/.05) from dual;

3430*(1/.05)


       68600

SQL> select 3453 *(1/.05) from dual;

3453*(1/.05)


       69060

SQL> select 3371 *(1/.05) from dual;

3371*(1/.05)


       67420

SQL> select 3504 *(1/.05) from dual;

3504*(1/.05)


       70080

SQL> select 3436 *(1/.05) from dual;

3436*(1/.05)


       68720

SQL> select 3336 *(1/.05) from dual;

3336*(1/.05)


       66720

SQL> select 3509 *(1/.05) from dual;

3509*(1/.05)


       70180

SQL> select 3394 *(1/.05) from dual;

3394*(1/.05)


       67880

SQL> select 3440 *(1/.05) from dual;

3440*(1/.05)


       68800

SQL> Some of these results are off by roughly 2.5 percent, admittedly a small amount, but it's small variations in statistics that can swing the optimizer toward one plan over another.

> You need to multiply by 1/(0.05) = 20 to get corret result.
>

Again, why? See above for explanation of how sample_size operates with the SELECT command.

> So it seems to me that in some cases DBMS_STATS "forgets" to
> multiply by 1/(sample size).
>
> If we do that then result will be much close:
> 5% sample size - 80,000x 20 = 1,600,000 distinct values
> 10% sample size -149,000 x 10 = 1,490,000 distinct values
> 20% sample size - 286,000 x 5 = 1,430,000 distinct values
> 50% sample size - 704,000 x 2 = 1,400,00 distinct values
> 100% sample size - 1,357,000 distinct values
>
> This was observed in Oracle 9.2.0.7 and 9.2.0.8 (Sun, AIX)

Thus, with your figures, you're over the actual count by almost 250,000 for your 5% estimate, quite a large number in excess of the actual to base optimizer actions upon.

As mentioned in a prior post please report the actual dbms_stats.gather_table_stats commands you executed to obtain these statistics. And, if possible, provide DDL for a representative table and sample data (or a method to generate such) so others can test this theory on newer releases of Oracle, and possibly on 9.2.0.8 on other platforms.

David Fitzjarrell Received on Mon May 18 2009 - 13:50:57 CDT

Original text of this message