Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> RE: Statistical sampling and representative stats collection

RE: Statistical sampling and representative stats collection

From: Jack Silvey <jack_silvey_at_yahoo.com>
Date: Sat, 25 May 2002 06:18:23 -0800
Message-ID: <F001.0046B8CB.20020525061823@fatcity.com>


Larry,

You are too self-effacing. I am trying to deal with the policital and territorial aspects of this db so *we* can do some analyze testing.

I would just like to say for public consumption that Larry has database tuning skills that he undoubtedly will end up going to hell for. DID you have to sign that contract with Old Scratch in blood to get those skills, or was there a forehead branding involved, or anything?

;)

Two things that occur to me that I would like to share for discussion, comments, and outright contradiction. Didn't realize this was going to be so long, fair warning:

  1. unnecessary histograms should be avoided since they add to parse load and CPU usage. If the data is not skewed, but histos are put into buckets anyway, Oracle will search those buckets to determine skewness. This is an unnecessary search and will burn more CPU.
  2. Histograms are indicated when data is skewed. Data skewness is a lack of symmetry of data around a central point.

One measure of a central point in a data set is the average - adding all the data points together and dividing by the number of elements. Average is the mathematical central point of a data set.

Another measure is median, in which the beginning and end data points are added together and dividied by 2. Median is the point at which half of the data points are on one side and half on the other.

In a dataset with normal distribution (non-skewed), measures of centrality will be the same point. If these two (or any) measures of data centrality are not the same, your data is skewed.

Now, data skewness should not be an automatic trigger for histograms. The percentage of data skewness needs to be taken into account and the cost of suboptimal query access paths identified. For instance, if you have a large dataset that has a column with unique values, and you insert one duplicate, this will introduce skewness, but will not necessarily mean that you should add the extra parse overhead of histograms.

So, what level of skewness should trigger histogram creation? This is going to be access dependent (didn't you know I was going to say that.)

In the absence of histograms, Oracle *must* assume that the data is normally distributed, which can lead to incorrect access paths.

Let's assume a table of 100,000 people. The range of ages will be from 0 to 120 years old.

Since we will have few people in the 110-120 range *relative to the other ranges* the data is skewed. However, in the absence of histograms, Oracle will assume that the number of people in this range is equal to the number of people in all the other ranges.

Let's assume that we have a query for people in the 80-120 year old range. Since in a normal distribution this would represent 33 percent of the table, Oracle may choose to do a full table scan. However, we know from experience in the real world that there are not many people in this range, so Oracle should be using an index for this lookup.

Are you an insurance company that wants to evaluate how many people will expire next year? Histograms are in the cards for you. Are you an HR person that wants to know how many people have 30 or more working years left? Since this range will be relatively normal perhaps a histogram will cost more than it is worth.

The key to determining which colums need histograms should be a balancing act between extra parse cost and cost of incorrect access paths.

At least, *I* think so, however, I also think that Star Trek is a historical document sent back through time by our future selves. Amazing how much the future looks just like the 1960's.

;)

Picard-Riker in 2004.

/jack silvey


Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: jack_silvey_at_yahoo.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
Received on Sat May 25 2002 - 09:18:23 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US