Re: does select with sample clause return variable record count

From: Malcolm Dew-Jones <yf110_at_vtn1.victoria.tc.ca>
Date: 6 Nov 2007 17:03:30 -0800
Message-ID: <47310ee2$1@news.victoria.tc.ca>

fitzjarrell_at_cox.net (fitzjarrell_at_cox.net) wrote:
: Comments embedded.
: On Nov 5, 5:06 am, tamsun <tam..._at_gmail.com> wrote:
: > hi,
: >
: > Database version: Oracle Database 10.2.0.2.0
: > OS: RedHat Enterprise 4
: >
: > I find select statement with sample clause return random count records,

: As it should.

: > e.g.
: >
: > select count(*) from ADS_DENDOC;
: > select count(*) from (select * from ADS_DENDOC sample(1)) a;
: >
: > 1st run:
: > -----------------
: > 1302; -- the record count of table ADS_DENDOC
: > 9; -- the record count of sample recordset
: >
: > 2nd run:
: > -----------------
: > 1302;
: > 12;
: >
: > 3rd run:
: > -----------------
: > 1302;
: > 14;
: >
: > 4th run:
: > -----------------
: > 1302;
: > 10;
: >
: > As Oracle sql reference written,sample_percent is a number specifying
: > the percentage of the total row or block count to be included in the
: > sample. The value must be in the range .000001 to (but not including)
: > 100. Why the record count of sample recordset is variable

: You didn't read far enough into that paragraph, the rest of which
: reads:

: "This percentage indicates the probability of each row, or each
: cluster of rows in the case of block sampling, being selected as part
: of the sample. It does not mean that the database will retrieve
: exactly sample_percent of the rows of table."

: This is an ESTIMATE, not an absolute. Your 'understanding' is
: actually misunderstanding, apparently based up on reading only what
: you want to read, rather than what you should be reading. If you
: think you understand a function, but it's behaviour isn't what you
: expect, then it's highly likely you don't fully understand all that
: you should about that function. Selective reading of the
: documentation is a prime source for such confusion.

: >
: > Best regards
: > tamsun

There are other sligthly subtle things to notice about sample.

If you choose a sample size that you expect will return about 1 row then there is a large probability on any one run that you will get no rows, so if you want to be sure to have at least one row then you need to ask for more than one row. (Of course you can't ask for any specific number of rows, what I mean is what you choose as a sample probability to get a desired size.)

But if you choose a sample size that likely returns more than one row, for the sake of example lets say 3 rows, then the first row in the sample will almost always come from amongst the first 1/3 of the data (what ever "first" means, but it doesn't seem to vary from run to run within a single table). So choosing a sample of (e.g.) about three rows and taking the first row will never be a good sample of the entire dataset.

However, sample does seem to run extremely quickly compared to other methods of getting random selections of data, so its nice to try to use it. To get a well chosen single row I think you can union a few individual sampling queries each of which should return about one row. That gives you a higher chance of having one row and it being from any where in the data.

I assume there are better techniques, but sample and union are both super easy to use - meaning I can type them at the sql prompt with a high chance of no mistakes. Received on Tue Nov 06 2007 - 19:03:30 CST