Trying to understand histograms

From: <BigBoote66_at_hotmail.com>
Date: 19 May 2005 09:00:02 -0700
Message-ID: <1116518402.335652.136050@f14g2000cwb.googlegroups.com>

Using Oracle 9.2.0.4.0, Solaris 9.

After reading as much as I can about histograms, I feel I'm still confused regarding histograms on columns that contain many more distinct values than the 255 bucket limit imposed by Oracle (referred to as height-based histograms by the Performance & Tuning Manual). Value-based histograms seem ideal - one bucket per distinct value, with the number column expressing the count on the value, but this obviously only works with relatively low-cardinality columns.

Specifically, we have a column that has around 4000 distinct values. If you were to construct a frequency histogram of these values, and order the columns from largest to smallest, you'd have a graph that would resemble a kind of hypotenuse, with a few values having relatively large counts, and a very long "tail" of low count values. The distribution is such that the most popular 400 entries make up 99% of the rows in the table - the remaining 3600 entries appear very times. For example, consider, it would look something like this:

VALUE        PERCENTAGE OF TABLE OCCUPIED
-----        ----------------------------
foo          10%
bar          5%
baz          2%
qux          1%
corgi        .5%
corgi2       .25%
corgi3       .1%
corgi4       .1%

...
penultimate (1 row)
ultimate (1 row)

The result of this distribution is that if a 255 bucket histogram was constructed, only a few (maybe the top 5 or 6) values would actually span 2 or more buckets (foo=23 buckets, bar=11 buckets, baz=5 buckets, qux=3 buckets, corgi=2 buckets, corg2 & all the rest, 1 bucket). So, as far as the histogram was concerned, 5 of the values are "popular" (and have a selectivity ranging from 10% to .5%), and all the rest have the same distribution, with a selectivity of about 80% * 1/4000 = .0002.

However, this is really not the case. In fact, because 400 entries make up 99% of the table, the real selectivity of the index is about 1/400 on average - the remaining 3600 values are used so infrequently that they can be discounted.

Why does this matter? Because when these indexes are bitmap indexes, the selectivity of the index is used by the optimizer in order to figure out what indexes to use in order to produce the ideal plan. If our table above had sixteen million rows, and we were constraining on columns that were in the "not popular" (the top 400 minus the top 5), the optimizer would think that a constraint on that column would reduce our selection to about 4000 rows, when in reality it's more like 40,000. If there were two columns that had similar distributions, it would think that constraining on both would have a selectivity of 1/4000 * 1/4000, or 1 row, when in reality it would be more like 1/400 * 1/400 = 100 rows.

The problem would exist for B-Trees as well. Consider a column that also had a skewed distribution, with 1000 values, but not such a long tail, such that the top 900 values represented 95% of the data. The "true" selectivity of this index would be closer to 1/900, making it twice as useful as the example above, but according to the optimizer, it would appear to only be 1/4 as good.

My conclusion is that histogram for high cardinality columns are only useful for cases where you have a relatively uniform distibution, but with a few "popular" values, as opposed to a few hundred "common" values with a large number of rarely-used & rarely occuring values.

Is this correct, or am I missing something? Received on Thu May 19 2005 - 11:00:02 CDT