RE: Calculating Distinct counts from a sample

From: Barr, Stephen <Stephen.Barr_at_BSkyB.com>
Date: Sun, 3 Jul 2005 16:38:56 +0100
Message-ID: <4127F81F6CAFC245A18BC49054EFB06302EDDCBB@ssslexchusr6.sssl.bskyb.com>

The issue is when oracle uses a SAMPLE - e.g. 10%, how does it then calculate the number of distinct values.

e.g

Table ORDERS has 1,000,000 rows.
Column CREATED_DT has 35,000 distinct values.

If we create a table with a 10% sample we get the following figures -

Number of rows = 100,000
Number of distinct values for CREATED_DT = 25,000

If we were to simply scale the distinct count up (as with the num_rows) then we would get a wrong answer.

My question is - what calculation does oracle do to get this figure reasonably accurate (it does get it reasonably accurate).

The only way I can see to get this right is to sample the sample...

-----Original Message-----
From: sol beach [mailto:sol.beach_at_gmail.com] Sent: 03 July 2005 16:30
To: Stephen.Barr_at_bskyb.com
Subject: Re: Calculating Distinct counts from a sample

HUH? I see your words, but don't understand what problem you are trying to solve.

What makes you think that "distinct counts" are "calculated"; as opposed to actually counted?

It appears that you are "assuming" a linear distribution of values. What happens if the data distribution is non-linear or skewed?

What is wrong with simply COUNTING the distinct values?

On 7/3/05, Barr, Stephen <Stephen.Barr_at_bskyb.com> wrote:
>
>
>
> How does oracle calculate distinct counts from a sample of data?
>
>
>
> I have traced dbms_stats but it doesn't give much of a clue as to how it
> does this. Initially I thought I could just use a simple calculation based
> on the percentage of distincts in any sample set compared to the number of
> rows in the sample set. So if the distinct count for the sample set was
30%
> of the number of rows in a 10% sample, then I would just scale this
up....but
> I now know how obviously flawed this method is.
>
>
>
> The only real way I can see of making this work is to apply some sort of
> linear regression function to the natural log of a number of additional
> samples from the sample set - which would allow me to come up with a
> "reasonable" scaling factor.
>
>
>
> Any mathematicians out there who could shed some light on this?
>
>
>
> Many thanks,
>
>
>
> Stephen Barr.
>
>
>
> ________________________________
>
>
>
>
> Information in this email may be privileged, confidential and is intended
> exclusively for the addressee. The views expressed may not be official
> policy, but the personal views of the originator. If you have received it
in
> error, please notify the sender by return e-mail and delete it from your
> system. You should not reproduce, distribute, store, retransmit, use or
> disclose its contents to anyone.
>
> Please note we reserve the right to monitor all e-mail communication
through
> our internal and external networks.

--
http://www.freelists.org/webpage/oracle-l

Received on Sun Jul 03 2005 - 10:41:12 CDT