Oracle-L: RE: Statistical sampling and representative stats collection

From: Jack Silvey <jack_silvey_at_yahoo.com>
Date: Sat, 25 May 2002 06:18:23 -0800
Message-ID: <F001.0046B8CB.20020525061823@fatcity.com>

Larry,

You are too self-effacing. I am trying to deal with the policital and territorial aspects of this db so *we* can do some analyze testing.

I would just like to say for public consumption that Larry has database tuning skills that he undoubtedly will end up going to hell for. DID you have to sign that contract with Old Scratch in blood to get those skills, or was there a forehead branding involved, or anything?

;)

Two things that occur to me that I would like to share for discussion, comments, and outright contradiction. Didn't realize this was going to be so long, fair warning:

unnecessary histograms should be avoided since they add to parse load and CPU usage. If the data is not skewed, but histos are put into buckets anyway, Oracle will search those buckets to determine skewness. This is an unnecessary search and will burn more CPU.
Histograms are indicated when data is skewed. Data skewness is a lack of symmetry of data around a central point.

One measure of a central point in a data set is the average - adding all the data points together and dividing by the number of elements. Average is the mathematical central point of a data set.

Another measure is median, in which the beginning and end data points are added together and dividied by 2. Median is the point at which half of the data points are on one side and half on the other.

In a dataset with normal distribution (non-skewed), measures of centrality will be the same point. If these two (or any) measures of data centrality are not the same, your data is skewed.

Now, data skewness should not be an automatic trigger for histograms. The percentage of data skewness needs to be taken into account and the cost of suboptimal query access paths identified. For instance, if you have a large dataset that has a column with unique values, and you insert one duplicate, this will introduce skewness, but will not necessarily mean that you should add the extra parse overhead of histograms.

So, what level of skewness should trigger histogram creation? This is going to be access dependent (didn't you know I was going to say that.)

In the absence of histograms, Oracle *must* assume that the data is normally distributed, which can lead to incorrect access paths.

Let's assume a table of 100,000 people. The range of ages will be from 0 to 120 years old.

Since we will have few people in the 110-120 range *relative to the other ranges* the data is skewed. However, in the absence of histograms, Oracle will assume that the number of people in this range is equal to the number of people in all the other ranges.

Let's assume that we have a query for people in the 80-120 year old range. Since in a normal distribution this would represent 33 percent of the table, Oracle may choose to do a full table scan. However, we know from experience in the real world that there are not many people in this range, so Oracle should be using an index for this lookup.

Are you an insurance company that wants to evaluate how many people will expire next year? Histograms are in the cards for you. Are you an HR person that wants to know how many people have 30 or more working years left? Since this range will be relatively normal perhaps a histogram will cost more than it is worth.

The key to determining which colums need histograms should be a balancing act between extra parse cost and cost of incorrect access paths.

At least, *I* think so, however, I also think that Star Trek is a historical document sent back through time by our future selves. Amazing how much the future looks just like the 1960's.

;)

Picard-Riker in 2004.

/jack silvey

Larry Elkins <elkinsl_at_flash.net> wrote:
> > Hi Jack,
> >
> > > One question - you mention that an index analyze
> > > provides beter data distribution. Could you
> discuss
> > > what you found in more detail?
> >
> > What I meant was that the Histograms that are
> created during an
> > ANALYZE/COMPUTE on Indexes will provide an almost
> perfect picture of the
> > data distribution in such columns. Under _some_
> circumstances,
> > the CBO will
> > be able to use this information to decide the best
> path (FTS or Indexed
> > read).
>
> And stats on the non-indexed columns can also play a
> large role in deciding
> driving table order and join methods. Ok, touched on
> that in an earlier
> email ;-)
>
> > On the other hand, and simply stated, when bind
> variables
> > are used in
> > a cursor, this information about data distribution
> is not used since the
> > value of the bind variable is not used during the
> parse prior to 9i.
>
> In my case, and Jack's (I'm now doing some work with
> a DB where Jack is
> dealing with the analyze strategies), the bind thing
> isn't an issue.
> Everything is ad-hoc, and, literals *are* used. But,
> there really isn't much
> of an opportunity for sharing SQL even if binds were
> used. One user might
> specify 5 values for one column, 3 values for
> another, 2 values for five
> other columns. The combinations of the criteria
> specified, and the number of
> values specified for each of those columns, not to
> mention the tables
> specified, very few, if any, of the SQL statements
> could be shared even if
> using binds. Plus, in this case, with histograms
> being very valuable, one
> could live with less cursor sharing even if there
> were some that could be
> shared when using binds. In this case, the literals
> are needed and their use
> is not causing any shared pool or library cache
> contention.
>
> >
> > Btw: Searching for 'bucket' in the 8i SQL
> reference came up with the NTILE
> > function (new in 8i), and I said "Wow!" because I
> was looking for such a
> > function. Goes to say that we need to read the
> fine manuals more than we
> > normally do!
>
> The analytic functions are great. The analytic
> functions first came about in
> 8.1.6, a few more functions added in 8.1.7, and
> taken even further in 9i. A
> lot of the traditional ways we might have done
> things, often times including
> self joins, or, procedural code, are thrown out the
> window. I've found all
> kinds of uses for them that (1) improve performance
> over the old approaches,
> and (2) are simpler to understand. Then again, some
> of the analytic function
> examples leave my head spinning. I'm still working
> through a lot of them for
> better understanding. But yeah, analytic functions
> like NTILE are very, very
> nice.
>
> >
> > John Kanagaraj
> > Oracle Applications DBA
> > DBSoft Inc
> > (W): 408-970-7002
>
> --
> Please see the official ORACLE-L FAQ:
> http://www.orafaq.com
> --
> Author: Larry Elkins
> INET: elkinsl_at_flash.net
>
> Fat City Network Services -- (858) 538-5051 FAX:
> (858) 538-5051
> San Diego, California -- Public Internet
> access / Mailing Lists
>

> To REMOVE yourself from this mailing list, send an
> E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of
> 'ListGuru') and in
> the message BODY, include a line containing: UNSUB
> ORACLE-L
> (or the name of mailing list you want to be removed
> from). You may
> also send the HELP command for other information
> (like subscribing).

Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: jack_silvey_at_yahoo.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Sat May 25 2002 - 09:18:23 CDT