Re: announce: the "Join Over Histograms" paper

From: Alberto Dell'Era <alberto.dellera_at_gmail.com>
Date: Thu, 26 Apr 2007 00:04:20 +0200
Message-ID: <4ef2fbf50704251504m478b3b6btbe6fbb7da57066a9@mail.gmail.com>

On 4/25/07, Thomas Day <tomdaytwo_at_gmail.com> wrote:
> Would I be correct in saying that if you have data that is, or could be,
> foreign keyed you should not generate histograms, since the CBO default will
> produce a better map than the formula?
>
> Especially if your foreign keys are to look-up tables.

For a single-column FK, yes, since in this context the standard formula will give you the *exact* cardinality (provided the statistics are current of course) - hard to beat that.

If you compute the histograms and those histograms end up being Frequency (quite probable for a lookup table), the CBO will give you about *half* the exact cardinality (as illustrated on page 18 of the paper, "Frequency Histograms and the mystery of halving").

It's easy to check that on a scenario with PK/FK constraints as well (not considered in the paper for simplicity); in 10.2.0.3:

create table lookup as select rownum id, 'x'||rownum val from dual connect by level <= 3;
alter table lookup add constraint lookup_pk primary key (id); create table data (id number);
alter table data add constraint data_fk foreign key (id) references lookup (id);

insert into data select 1 from dual connect by level <= 100;
insert into data select 2 from dual connect by level <= 220;
insert into data select 3 from dual connect by level <= 120;

exec dbms_stats.gather_table_stats (user, 'lookup', method_opt=>'for all columns size 1', estimate_percent=>100, cascade=>true); exec dbms_stats.gather_table_stats (user, 'data' , method_opt=>'for all columns size 1', estimate_percent=>100, cascade=>true);

select count(*) from data, lookup where data.id = lookup.id; COUNT(*)

440

set autotrace traceonly explain
select /*+ use_hash (data lookup) */ * from data, lookup where data.id = lookup.id;
set autotrace off

|   0 | SELECT STATEMENT   |        |   440 |
|*  1 |  HASH JOIN         |        |   440 |
|   2 |   TABLE ACCESS FULL| LOOKUP |     3 |
|   3 |   TABLE ACCESS FULL| DATA   |   440 |
---------------------------------------------

Cardinality = 440 - exact.

exec dbms_stats.gather_table_stats (user, 'lookup', method_opt=>'for all columns size 251', estimate_percent=>100, cascade=>true); exec dbms_stats.gather_table_stats (user, 'data' , method_opt=>'for all columns size 254', estimate_percent=>100, cascade=>true);

set autotrace traceonly explain
select /*+ use_hash (data lookup) */ * from data, lookup where data.id = lookup.id;
set autotrace off

|   0 | SELECT STATEMENT   |        |   221 |
|*  1 |  HASH JOIN         |        |   221 |
|   2 |   TABLE ACCESS FULL| LOOKUP |     3 |
|   3 |   TABLE ACCESS FULL| DATA   |   440 |
---------------------------------------------

Cardinality = 221 = 440 / 2 + 1 (due to rounding to 1 of the "not-populars subtable" contrib)

It might be useful to compute the histograms for other reasons, since they may have a positive effect on other statements, but it's worth knowing this strange "halving" effect, especially since it's counterintuitive (I would have expected a better estimate with an histogram, not a 50% underestimation for every lookup table in the join statement).

To get back to the exact estimate, it's enough to erase the histogram on the PK, and the CBO will fall back to the standard formula. As Wolfgang Breitling noticed in his papers, if you use SIZE SKEWONLY or SIZE AUTO you will greatly lessen the probability to compute an histogram on the PK by mistake, and one wouldn't want to compute an histogram on a column with unique values in the first place - but one would also expect, intuitively, that the histogram will be at worst useless, not pejorative; and so one may think about collecting one if e.g. she frequently issues statements such as select ... from lookup where id between 2 and 3.

Have you noticed the "it depends" pattern here ... this strange halving is here to complicate things.

For multi-column FKs to lookup tables, I haven't checked the effect of histograms
(I ran out of steam ... it has been a mammoth investigation). Should not be a much frequent scenario anyway.

> I found your comments about possible bugs (loop overruns and weighing high
> values more than low values) very interesting.

Thanks, appreciate that :) Actually is *ignoring* low values below the first matching one - I think you'll find the scenario on page 19, "Detecting peaks", quite interesting if you haven't read it yet. It could easily make for quite unstable card estimations.

regards
Alberto

-- 
Alberto Dell'Era
"dulce bellum inexpertis"
--
http://www.freelists.org/webpage/oracle-l

Received on Wed Apr 25 2007 - 17:04:20 CDT