Re: Which one is good in terms of performance

From: Tim Gorman <tim_at_evdbt.com>
Date: Fri, 24 Dec 2004 08:23:17 -0700
Message-ID: <BDF18276.2148D%tim@evdbt.com>

Seema and Wolfgang,
Here is some info I generated recently, to add to the discussion...

My apologies for the length of the post, but feel free to delete if not interested...

Accuracy when calculating table statistics with different sample sizes:
>

These are the cumulated averaged results from comparing statistics gathered using both DBMS_STATS and ANALYZE TABLE against the baseline of performing a COMPUTE using DBMS_STATS. For the first round, I concentrated on table statistics, and ignored column statistics (a.k.a. histograms) by using ³FOR ALL COLUMNS SIZE 1². I tested against four major tables in the Oracle E-Business Suite ERP system, of varying sizes (from small to humongous).

For each of the table statistics gathered for the cost-based optimizer (i.e. the columns NUM_ROWS, BLOCKS, and AVG_ROW_LEN in DBA_TABLES), displayed is the percentage deviation from the baseline of a COMPUTE using DBMS_STATS:

                                                   Nbr of
                                                   tables % Deviation %
Deviation % Deviation
LABEL                                             sampled Nbr of Blks Nbr of

Rows Avg Row Len

-------------------------------------------------- ------- -----------
----------- -----------
DBMS_STATS: COMPUTE, FOR ALL COLUMNS SIZE 1              4       0.00
0.00        0.00
DBMS_STATS: 10.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.02        0.00
DBMS_STATS: 05.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.03        0.00
DBMS_STATS: 01.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.10        0.00
DBMS_STATS: 50.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.22        0.00
DBMS_STATS: 10.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
0.15        0.11
DBMS_STATS: 00.010000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.46        0.00
DBMS_STATS: 20.000000%, FOR ALL COLUMNS SIZE 1           4       0.00
0.58        0.00
DBMS_STATS: 05.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
0.91        0.11
DBMS_STATS: 00.000001%, FOR ALL COLUMNS SIZE 1           4       0.00
1.12        0.00
DBMS_STATS: 20.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
1.35        0.00
DBMS_STATS: 00.010000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
1.47        0.00
DBMS_STATS: 00.000001% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
1.58        0.00
DBMS_STATS: 50.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
2.03        0.00
DBMS_STATS: 01.000000% BLKSAMP, FOR ALL COLUMNS SI       4       0.00
4.61        0.00
ANALYZE: 05.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
0.42        7.56
ANALYZE: COMPUTE, FOR ALL COLUMNS SIZE 1                 4       0.00
0.00        8.34
ANALYZE: 50.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
0.00        8.34
ANALYZE: 20.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
0.62        7.73
ANALYZE: 10.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
1.44        7.15
ANALYZE: 01.000000%, FOR ALL COLUMNS SIZE 1              4       0.00
7.24        8.62
ANALYZE: 00.010000%, FOR ALL COLUMNS SIZE 1              4       0.00
7.24        8.62
ANALYZE: 00.000001%, FOR ALL COLUMNS SIZE 1              4       0.00
7.24        8.62

The sorting in this report is by the summarization of the three percentage columns, from "most accurate" results to "least accurate" results.

A couple things to note about the results shown here:

The value in BLOCKS is always completely accurate, regardless of sample size, because the value is not calculated at all, but is obtained directly from the segment header
The very worst calculations from the DBMS_STATS package are (apparently) more accurate than the very best calculations from ANALYZE command
> * The biggest cause seems to be vastly different results on AVG_ROW_LEN
>> * However, the method of calculating AVG_ROW_LEN by DBMS_STATS is easily >> exposed by performing SQL trace >>> * Method is: select avg(vsize(col-1)) + avg(vsize(col-2)) + ... + >>> avg(vsize(col-N)) from table-name ...
> * The method of calculating AVG_ROW_LEN by the ANALYZE command cannot be
> determined using SQL trace
The worst deviation with the results from DBMS_STATS when calculating NUM_ROWS is 4.61%, while the worst deviation with the results from ANALYZE is almost twice that, at 7.24%
The results from ANALYZE TABLE ... SAMPLE 1 PERCENT, SAMPLE 0.01 PERCENT, and SAMPLE 0.000001 PERCENT are exactly the same, indicating the lowest valid sampling value for the ANALYZE command is 1 PERCENT
> * DBMS_STATS allows 0.000001% sampling
> * DBMS_STATS also permits the use of BLOCK SAMPLING, which ANALYZE does not

Accuracy when calculating index statistics with different sample sizes:

In contrast, where even the very worst results obtained by the smallest sample of table statistics is just 7.24% off from performing a COMPUTE operation, the results from using smaller samples when calculating index statistics get much worse very quickly as the sampling size decreases.

Here is the results of testing against the nine indexes of the FND_CONCURRENT_REQUESTS table, which vary widely in terms of density, data characteristics, etc. The results are being sorted by the summation of the percentage columns, so that the "most accurate" shows first and the "least accurate" shows last...

                               Nbr of                             % Dev'n  %
Dev'n
                              indexes % Dev'n   % Dev'n   % Dev'n Avg Leaf
Avg Data % Dev'n       % Dev'n
LABEL                         sampled  BLevel Leaf Blks Dist Keys Blks/Key

Blks/Key Clu Fac # Idx Entries

------------------------------ ------- ------- --------- --------- --------
-------- ------- -------------
DBMS_STATS: INDEX COMPUTE           9    0.00      0.00      0.00     0.00
0.00   0.00          0.00
DBMS_STATS: INDEX 05.000000%        9    0.00      0.00     22.13    14.85
16.40   0.44          0.00
DBMS_STATS: INDEX 20.000000%        9    0.00      0.00     25.39    24.29
24.22   0.44          0.00
DBMS_STATS: INDEX 10.000000%        9    0.00      0.00     25.56    16.61
33.89   0.44          0.00
DBMS_STATS: INDEX 00.010000%        9    0.00      0.00     27.45    20.13
37.91   0.44          0.00
DBMS_STATS: INDEX 50.000000%        9    0.00      0.00     26.42    21.53
38.65   0.44          0.00
DBMS_STATS: INDEX 01.000000%        9    0.00      0.00     31.17    27.90
28.18   0.44          0.00
DBMS_STATS: INDEX 00.000001%        9    0.00      0.00     27.81    22.46
40.37   0.44          0.00
ANALYZE: INDEX COMPUTE              9    0.00    595.30     0.00    239.88
0.00   0.00          0.00
ANALYZE: INDEX 50.000000%           9    0.00    596.27     23.55   329.31
13.85  17.29         17.60
ANALYZE: INDEX 01.000000%           9    8.33    530.13     32.28   353.86
23.04  32.87         34.08
ANALYZE: INDEX 00.010000%           9    8.33    530.13     32.28   353.86
23.04  32.87         34.08
ANALYZE: INDEX 00.000001%           9    8.33    530.13     32.28   353.86
23.04  32.87         34.08
ANALYZE: INDEX 20.000000%           9    0.00    596.27     32.97   551.43
58.40  19.22         19.51
ANALYZE: INDEX 05.000000%           9    0.00    596.27     40.77   854.46
59.40  24.67         23.23
ANALYZE: INDEX 10.000000%           9    0.00    596.27     40.86   792.72
141.73  29.25         29.07

First of all, there is the huge deviation the calculations of LEAF_BLOCKS by DBMS_STATS and by ANALYZE. Each command (DBMS_STATS or ANALYZE) is consistent, so one must be wrong and the other right. Unfortunately, in this case, it is the DBMS_STATS package that is uniformly wrong on this, because the results produced by the ANALYZE INDEX commands are corroborated by similar results from the ANALYZE INDEX ... VALIDATE STRUCTURE command, the DBMS_SPACE package, and the values stored in the DBA_SEGMENTS view. I searched MetaLink for a bug report related to DBMS_STATS and LEAF_BLOCKS, but could not find one. I intend to retest this against 9.2.0.5 and 10.1.0.3, when I get a chance. I¹ll also test it against 9.2.0.1...

A few more things to note about the results shown here:

DBMS_STATS.GATHER_INDEX_STATS and ANALYZE INDEX produce the exact same results when COMPUTEing
> * except the LEAF_BLOCKS anomaly
>> * which also apparently causes an anomaly with AVG_LEAF_BLOCKS_PER_KEY
the ANALYZE INDEX ... COMPUTE results are corroborated by other utilities, so the ANALYZE INDEX command is more accurate in v9.2.0.4 than the DBMS_STATS.GATHER_INDEX_STATS command
The results from both DBMS_STATS and ANALYZE are wildly worse when performing an estimated sample, instead of compute

That last point is the most important. Nonetheless, since the ANALYZE INDEX ... COMPUTE command is clearly the most accurate, here is the same result data, this time using the results from ANALYZE INDEX .. COMPUTE STATISTICS as the baseline instead of DBMS_STATS:
>

                               Nbr of                             % Dev'n  %
Dev'n
                              indexes % Dev'n   % Dev'n   % Dev'n Avg Leaf
Avg Data % Dev'n       % Dev'n
LABEL                         sampled  BLevel Leaf Blks Dist Keys Blks/Key

Blks/Key Clu Fac # Idx Entries

------------------------------ ------- ------- --------- --------- --------
-------- ------- -------------
ANALYZE: INDEX COMPUTE              9    0.00     0.00      0.00    0.00
0.00    0.00          0.00
ANALYZE: INDEX 50.000000%           9    0.00     0.13     23.55    15.18
13.85   17.29         17.60
DBMS_STATS: INDEX COMPUTE           9    0.00    85.04      0.00    41.30
0.00    0.00          0.00
DBMS_STATS: INDEX 05.000000%        9    0.00    85.04     22.13    38.50
16.40    0.44          0.00
DBMS_STATS: INDEX 20.000000%        9    0.00    85.04     25.39    36.70
24.22    0.44          0.00
ANALYZE: INDEX 01.000000%           9    8.33     8.42     32.28    36.11
23.04   32.87         34.08
ANALYZE: INDEX 00.010000%           9    8.33     8.42     32.28    36.11
23.04   32.87         34.08
ANALYZE: INDEX 00.000001%           9    8.33     8.42     32.28    36.11
23.04   32.87         34.08
DBMS_STATS: INDEX 01.000000%        9    0.00    85.04     31.17    36.24
28.18    0.44          0.00
DBMS_STATS: INDEX 10.000000%        9    0.00    85.04     25.56    38.07
33.89    0.44          0.00
ANALYZE: INDEX 20.000000%           9    0.00     0.13     32.97    53.44
58.40   19.22         19.51
DBMS_STATS: INDEX 50.000000%        9    0.00    85.04     26.42    37.22
38.65    0.44          0.00
DBMS_STATS: INDEX 00.010000%        9    0.00    85.04     27.45    37.53
37.91    0.44          0.00
DBMS_STATS: INDEX 00.000001%        9    0.00    85.04     27.81    37.05
40.37    0.44          0.00
ANALYZE: INDEX 05.000000%           9    0.00     0.13     40.77   107.16
59.40   24.67         23.23
ANALYZE: INDEX 10.000000%           9    0.00     0.13     40.86    95.99
141.73   29.25         29.07

Again, this perspective shows the two types of COMPUTE largely agreeing with each other (except for the LEAF_BLOCKS and the AVG_LEAF_BLOCKS_PER_KEY value derived from it), with all other results showing dramatically less accurate due to decreases in the sample size.

For indexes, clearly computing yields the best results, while a bug in DBMS_STATS appears to be skewing the LEAF_BLOCK results in v9.2.0.4. making the ANALYZE INDEX command the best alternative for that version.

So, from this data, which is admittedly sparse and limited (needs more testing on a wider range of versions and applications), I think that one can conclude:

Perform the gathering of table/column and index statistics separately from one another
> * Gather statistics on tables and columns in one operation using
> DBMS_STATS.GATHER_TABLE_STATS
>> * leave the CASCADE parameter at the default of FALSE >> * use a low ESTIMATE_PERCENT and even use BLOCK_SAMPLE=>TRUE, if necessary
> * Gather statistics on indexes in another operation to either ANALYZE INDEX or
> DBMS_STATS.GATHER_INDEX_STATS
>> * only do COMPUTE
There is a bug in DBMS_STATS.GATHER_INDEX_STATS in 9.2.0.4 regarding the calculation of LEAF_BLOCKS that needs to be explored further...

>From this, I don¹t think that one can conclude that it is better to ANALYZE INDEX ... COMPUTE instead of
DBMS_STATS.GATHER_INDEX_STATS(ESTIMATE_PERCENT=>NULL), because I don¹t know if the anomaly found exists outside of 9.2.0.4...

Hope this helps...

-Tim

on 12/23/04 1:26 PM, Wolfgang Breitling at breitliw_at_centrexcc.com wrote:

> first off, whatever you do, DO NOT use 2) dbms_utility
>
> What I do is:
> a) DO NOT rely on Oracle's "staleness" algorithm but decide myself which
> tables need to be analyzed and at what frequency
> b) DO NOT use "for all indexed columns" but decide myself which columns
> require a histogram and with how many buckets. Those columns might very
> well include non-indexed ones and will certainly not include all indexed
> columns. I have an example where the creation of histograms on indexed
> columns led to a batch job taking an estimated 18+ hours (if we had had
> the patience to let it finish instead of killing it after 6 hours)
> instead of the ~90 seconds without the histograms
> c) use the dbms_stats procedures
> d) use estimate_percent=>dbms_stats.auto_sample-size, cascade=>false,
> method_opt=>'for all columns size 1' (i.e. the defaults for the latter
> two parameters) for tables
> e) follow with gather_index_stats with estimate_percent=>null (i.e. full
> compute) for indexes
> f) follow with gather_table_stats with estimate_percent=> (i.e. full
> compute) for individual column histograms.

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Dec 24 2004 - 09:18:18 CST