Calculating Distinct counts from a sample

From: Barr, Stephen <Stephen.Barr_at_BSkyB.com>
Date: Sun, 3 Jul 2005 15:48:32 +0100
Message-ID: <4127F81F6CAFC245A18BC49054EFB06302EDDCA9@ssslexchusr6.sssl.bskyb.com>

How does oracle calculate distinct counts from a sample of data?

I have traced dbms_stats but it doesn't give much of a clue as to how it does this. Initially I thought I could just use a simple calculation based on the percentage of distincts in any sample set compared to the number of rows in the sample set. So if the distinct count for the sample set was 30% of the number of rows in a 10% sample, then I would just scale this up....but I now know how obviously flawed this method is.

The only real way I can see of making this work is to apply some sort of linear regression function to the natural log of a number of additional samples from the sample set - which would allow me to come up with a "reasonable" scaling factor.

Any mathematicians out there who could shed some light on this?

Many thanks,

Stephen Barr.

Information in this email may be privileged, confidential and is intended exclusively for the addressee. The views expressed may not be official policy, but the personal views of the originator. If you have received it in error, please notify the sender by return e-mail and delete it from your system. You should not reproduce, distribute, store, retransmit, use or disclose its contents to anyone. Please note we reserve the right to monitor all e-mail communication through our internal and external networks.

--
http://www.freelists.org/webpage/oracle-l

Received on Sun Jul 03 2005 - 09:51:45 CDT