From oracle-l-bounce@freelists.org Sun Jul 3 09:51:45 2005 Return-Path: Received: from air891.startdedicated.com (root@localhost) by orafaq.com (8.12.10/8.12.10) with ESMTP id j63EpiKm016513 for ; Sun, 3 Jul 2005 09:51:44 -0500 X-ClientAddr: 206.53.239.180 Received: from turing.freelists.org (freelists-180.iquest.net [206.53.239.180] (may be forged)) by air891.startdedicated.com (8.12.10/8.12.10) with ESMTP id j63EpaIP016500 for ; Sun, 3 Jul 2005 09:51:36 -0500 Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 6FF281C7ADC; Sun, 3 Jul 2005 09:51:34 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 09928-09; Sun, 3 Jul 2005 09:51:34 -0500 (EST) Received: from turing (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id E167E1C7AD1; Sun, 3 Jul 2005 09:51:33 -0500 (EST) Message-ID: <4127F81F6CAFC245A18BC49054EFB06302EDDCA9@ssslexchusr6.sssl.bskyb.com> From: "Barr, Stephen" To: oracle-l@freelists.org Subject: Calculating Distinct counts from a sample Date: Sun, 3 Jul 2005 15:48:32 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C57FDE.487BC230" X-archive-position: 22025 X-ecartis-version: Ecartis v1.0.0 Sender: oracle-l-bounce@freelists.org Errors-To: oracle-l-bounce@freelists.org X-original-sender: Stephen.Barr@BSkyB.com Precedence: normal Reply-To: Stephen.Barr@BSkyB.com X-list: oracle-l X-Virus-Scanned: by amavisd-new-20030616-p9 (Debian) at avenirtech.net X-mailscan-MailScanner-Information: Please contact the ISP for more information X-mailscan-MailScanner: Found to be clean X-MailScanner-From: oracle-l-bounce@freelists.org X-Spam-Level: X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on air891.startdedicated.com X-Spam-Status: No, hits=-3.1 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE autolearn=ham version=2.63 ------_=_NextPart_001_01C57FDE.487BC230 Content-Type: text/plain How does oracle calculate distinct counts from a sample of data? I have traced dbms_stats but it doesn't give much of a clue as to how it does this. Initially I thought I could just use a simple calculation based on the percentage of distincts in any sample set compared to the number of rows in the sample set. So if the distinct count for the sample set was 30% of the number of rows in a 10% sample, then I would just scale this up....but I now know how obviously flawed this method is. The only real way I can see of making this work is to apply some sort of linear regression function to the natural log of a number of additional samples from the sample set - which would allow me to come up with a "reasonable" scaling factor. Any mathematicians out there who could shed some light on this? Many thanks, Stephen Barr. ----------------------------------------- Information in this email may be privileged, confidential and is intended exclusively for the addressee. The views expressed may not be official policy, but the personal views of the originator. If you have received it in error, please notify the sender by return e-mail and delete it from your system. You should not reproduce, distribute, store, retransmit, use or disclose its contents to anyone. Please note we reserve the right to monitor all e-mail communication through our internal and external networks. ------_=_NextPart_001_01C57FDE.487BC230 Content-Type: text/html Content-Transfer-Encoding: quoted-printable =0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D= =0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A
=0D=0A=0D=0A

How does oracle= calculate=0D=0Adistinct counts from a sample of data?

=0D=0A=0D=0A

 

=0D=0A=0D=0A

I have = traced dbms_stats but it doesn’t give much of a=0D=0Aclue as to how i= t does this=2E Initially I thought I could just use a simple=0D=0Acalculati= on based on the percentage of distincts in any sample set compared to=0D=0A= the number of rows in the sample set=2E So if the distinct count for the sa= mple=0D=0Aset was 30% of the number of rows in a 10% sample, then I would j= ust scale this=0D=0Aup…=2Ebut I now know how obviously flawed this me= thod is=2E

=0D=0A=0D=0A

 

=0D=0A=0D=0A

The only real way I can see of making this work is to apply= =0D=0Asome sort of linear regression function to the natural log of a numbe= r of=0D=0Aadditional samples from the sample set – which would allow = me to come up=0D=0Awith a “reasonable” scaling factor=2E

=0D=0A=0D=0A

&n= bsp;

=0D=0A=0D=0A

An= y mathematicians out there who could shed some light on=0D=0Athis?

=0D=0A=0D=0A

 = ;

=0D=0A=0D=0A

Many = thanks,

=0D=0A=0D=0A

 

=0D=0A=0D=0A

<= font size=3D2 face=3DArial>Stephen Barr=2E

=0D=0A=0D=0A

 

=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A=0D=0A


=

=0D=0A

=0D=0AInformation in this email may be privileged, con= fidential and is intended=0D=0Aexclusively for the addressee=2E The views e= xpressed may not be official=0D=0Apolicy, but the personal views of the ori= ginator=2E If you have received it=0D=0Ain error, please notify the sender = by return e-mail and delete it from your=0D=0Asystem=2E You should not repr= oduce, distribute, store, retransmit, use or=0D=0Adisclose its contents to = anyone=2E

Please note we reserve the right=0D=0Ato monitor all e-ma= il communication through our internal and external=0D=0Anetworks=2E=0D=0A

=0D=0A=0D=0A=0D=0A ------_=_NextPart_001_01C57FDE.487BC230-- -- http://www.freelists.org/webpage/oracle-l