Re: Trying to understand histograms

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Thu, 19 May 2005 18:05:44 +0000 (UTC)
Message-ID: <d6ikho$h5j$1@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com>

<BigBoote66_at_hotmail.com> wrote in message news:1116518402.335652.136050_at_f14g2000cwb.googlegroups.com...
> Using Oracle 9.2.0.4.0, Solaris 9.
>
>
> Specifically, we have a column that has around 4000 distinct values.
> If you were to construct a frequency histogram of these values, and
> order the columns from largest to smallest, you'd have a graph that
> would resemble a kind of hypotenuse, with a few values having
> relatively large counts, and a very long "tail" of low count values.
> The distribution is such that the most popular 400 entries make up 99%
> of the rows in the table - the remaining 3600 entries appear very
> times. For example, consider, it would look something like this:
>
> VALUE PERCENTAGE OF TABLE OCCUPIED
> ----- ----------------------------
> foo 10%
> bar 5%
> baz 2%
> qux 1%
> corgi .5%
> corgi2 .25%
> corgi3 .1%
> corgi4 .1%
> ...
> penultimate (1 row)
> ultimate (1 row)
>
> The result of this distribution is that if a 255 bucket histogram was
> constructed, only a few (maybe the top 5 or 6) values would actually
> span 2 or more buckets (foo=23 buckets, bar=11 buckets, baz=5 buckets,
> qux=3 buckets, corgi=2 buckets, corg2 & all the rest, 1 bucket). So,
> as far as the histogram was concerned, 5 of the values are "popular"
> (and have a selectivity ranging from 10% to .5%), and all the rest have
> the same distribution, with a selectivity of about 80% * 1/4000 =
> .0002.
>
> My conclusion is that histogram for high cardinality columns are only
> useful for cases where you have a relatively uniform distibution, but
> with a few "popular" values, as opposed to a few hundred "common"
> values with a large number of rarely-used & rarely occuring values.
>
> Is this correct, or am I missing something?
>

There are various other reasons for having histograms that you have not seen - but for your example, your conclusions are about right.

Your optimum solution is to construct a suitable histogram, and set the column stats directly. You may have to rig a set of numbers that only identifies 75 of your most popular values with some degree of discrimination - and then you'll have to set the column density to be 1/400 to catch the rest of the popular values. (Check dbms_stats.set_column_stats)

The trouble is, no matter what Oracle offered, someone would say it wasn't enough. If they could catch 512 values, someone would want 600.

You may want to look into the dynamic_sampling() hint. Since you are clearly on a data warehouse, the overhead of a dynamic check against that particular table may be a price worth paying to get the right indexes used.

-- 
Regards

Jonathan Lewis

http://www.jlcomp.demon.co.uk/faq/ind_faq.html
The Co-operative Oracle Users' FAQ

http://www.jlcomp.demon.co.uk/seminar.html
Public Appearances - schedule updated April 5th 2005

Received on Thu May 19 2005 - 13:05:44 CDT