Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.misc -> Re: does select with sample clause return variable record count

Re: does select with sample clause return variable record count

From: William Robertson <williamr2019_at_googlemail.com>
Date: Wed, 07 Nov 2007 11:24:57 -0800
Message-ID: <1194463497.410589.164410@50g2000hsm.googlegroups.com>


On Nov 7, 1:03 am, yf..._at_vtn1.victoria.tc.ca (Malcolm Dew-Jones) wrote:
> fitzjarr..._at_cox.net (fitzjarr..._at_cox.net) wrote:
>
> : Comments embedded.
> : On Nov 5, 5:06 am, tamsun <tam..._at_gmail.com> wrote:
> : > hi,
> : >
> : > Database version: Oracle Database 10.2.0.2.0
> : > OS: RedHat Enterprise 4
> : >
> : > I find select statement with sample clause return random count records,
>
> : As it should.
>
> : > e.g.
> : >
> : > select count(*) from ADS_DENDOC;
> : > select count(*) from (select * from ADS_DENDOC sample(1)) a;
> : >
> : > 1st run:
> : > -----------------
> : > 1302; -- the record count of table ADS_DENDOC
> : > 9; -- the record count of sample recordset
> : >
> : > 2nd run:
> : > -----------------
> : > 1302;
> : > 12;
> : >
> : > 3rd run:
> : > -----------------
> : > 1302;
> : > 14;
> : >
> : > 4th run:
> : > -----------------
> : > 1302;
> : > 10;
> : >
> : > As Oracle sql reference written,sample_percent is a number specifying
> : > the percentage of the total row or block count to be included in the
> : > sample. The value must be in the range .000001 to (but not including)
> : > 100. Why the record count of sample recordset is variable
>
> : You didn't read far enough into that paragraph, the rest of which
> : reads:
>
> : "This percentage indicates the probability of each row, or each
> : cluster of rows in the case of block sampling, being selected as part
> : of the sample. It does not mean that the database will retrieve
> : exactly sample_percent of the rows of table."
>
> : This is an ESTIMATE, not an absolute. Your 'understanding' is
> : actually misunderstanding, apparently based up on reading only what
> : you want to read, rather than what you should be reading. If you
> : think you understand a function, but it's behaviour isn't what you
> : expect, then it's highly likely you don't fully understand all that
> : you should about that function. Selective reading of the
> : documentation is a prime source for such confusion.
>
> : >
> : > Best regards
> : > tamsun
>
> There are other sligthly subtle things to notice about sample.
>
> If you choose a sample size that you expect will return about 1 row then
> there is a large probability on any one run that you will get no rows, so
> if you want to be sure to have at least one row then you need to ask for
> more than one row. (Of course you can't ask for any specific number of
> rows, what I mean is what you choose as a sample probability to get a
> desired size.)
>
> But if you choose a sample size that likely returns more than one row, for
> the sake of example lets say 3 rows, then the first row in the sample will
> almost always come from amongst the first 1/3 of the data (what ever
> "first" means, but it doesn't seem to vary from run to run within a single
> table). So choosing a sample of (e.g.) about three rows and taking the
> first row will never be a good sample of the entire dataset.
>
> However, sample does seem to run extremely quickly compared to other
> methods of getting random selections of data, so its nice to try to use
> it. To get a well chosen single row I think you can union a few
> individual sampling queries each of which should return about one row.
> That gives you a higher chance of having one row and it being from any
> where in the data.
>
> I assume there are better techniques, but sample and union are both super
> easy to use - meaning I can type them at the sql prompt with a high chance
> of no mistakes.

You can also use a SEED(n) expression if you want to repeat the same random sample, or attempt to at least:

"SEED seed_value Specify this clause to instruct the database to attempt to return the same sample from one execution to the next. The seed_value must be an integer between 0 and 4294967295. If you omit this clause, then the resulting sample will change from one execution to the next." Received on Wed Nov 07 2007 - 13:24:57 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US