Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Mailing Lists -> Oracle-L -> RE: Calculating Distinct counts from a sample
The issue is when oracle uses a SAMPLE - e.g. 10%, how does it then
calculate the number of distinct values.
e.g
Table ORDERS has 1,000,000 rows.
Column CREATED_DT has 35,000 distinct values.
If we create a table with a 10% sample we get the following figures -
Number of rows = 100,000
Number of distinct values for CREATED_DT = 25,000
If we were to simply scale the distinct count up (as with the num_rows) then we would get a wrong answer.
My question is - what calculation does oracle do to get this figure reasonably accurate (it does get it reasonably accurate).
The only way I can see to get this right is to sample the sample...
-----Original Message-----
From: sol beach [mailto:sol.beach_at_gmail.com]
Sent: 03 July 2005 16:30
To: Stephen.Barr_at_bskyb.com
Subject: Re: Calculating Distinct counts from a sample
HUH? I see your words, but don't understand what problem you are trying to solve.
What makes you think that "distinct counts" are "calculated"; as opposed to actually counted?
It appears that you are "assuming" a linear distribution of values. What happens if the data distribution is non-linear or skewed?
What is wrong with simply COUNTING the distinct values?
On 7/3/05, Barr, Stephen <Stephen.Barr_at_bskyb.com> wrote:
>
>
>
> How does oracle calculate distinct counts from a sample of data?
>
>
>
> I have traced dbms_stats but it doesn't give much of a clue as to how it
> does this. Initially I thought I could just use a simple calculation based
> on the percentage of distincts in any sample set compared to the number of
> rows in the sample set. So if the distinct count for the sample set was
30%
> of the number of rows in a 10% sample, then I would just scale this
up....but
> I now know how obviously flawed this method is.
>
>
>
> The only real way I can see of making this work is to apply some sort of
> linear regression function to the natural log of a number of additional
> samples from the sample set - which would allow me to come up with a
> "reasonable" scaling factor.
>
>
>
> Any mathematicians out there who could shed some light on this?
>
>
>
> Many thanks,
>
>
>
> Stephen Barr.
>
>
>
> ________________________________
>
>
>
>
> Information in this email may be privileged, confidential and is intended
> exclusively for the addressee. The views expressed may not be official
> policy, but the personal views of the originator. If you have received it
in
> error, please notify the sender by return e-mail and delete it from your
> system. You should not reproduce, distribute, store, retransmit, use or
> disclose its contents to anyone.
>
> Please note we reserve the right to monitor all e-mail communication
through
> our internal and external networks.
-- http://www.freelists.org/webpage/oracle-lReceived on Sun Jul 03 2005 - 10:41:12 CDT