Re: Stochastic Queries

From: -CELKO- <jcelko212_at_earthlink.net>
Date: Thu, 20 Sep 2007 16:01:56 -0700
Message-ID: <1190329316.117411.111390_at_n39g2000hsh.googlegroups.com>


>> I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft. <<

I need to get a copy, too.

You might want to look at a current article by Ben Gan on the use of RAND() and NEWID() and differences in SQL Server 2000 and 2005. He covers the problems with how a CASE expression uses RAND(), deterministic functions, duplicate values, etc.

>> I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.) <<

The SQL products I have worked with use traditional Linear Congruential algorithms, which they inherited from C and UNIX. As you get larger and larger samples, you get skewing and duplicate values. Knuth Vol #2, Chapter 3 has a good history and some remarks about the history.

The best (worst?) horror story in the 1970's was the discovery that an IBM FORTRAN routine was not valid. It trashed quite a few PhD projects. That was the most popular tool in those days for research.

>> Repeated use of the same set of random numbers will generate an identical sample each time it's used. <<

Not if the population changes each time. But I understand your point.

However, consider that one of the advantages of RNG is that you can repeat an experiment. I worked with some lab equipment on an old DEC PDP-11 more decades ago than I really like to remember with a special circuit card. This thing had a speck of radioactive material and a simple Geiger counter tube to create **quantum level** random digits. Of course people did not seeing that black & yellow radiation symbol on equipment in those days (Cold War Era) so I am not sure if it is still available. You can probably use background radiation or radio noise today with sensitive equipment.

But if I want a fixed table, I think most statisticians would agree that the RAND corporation "Table of One Million Random Digits" has been tested in every possible way for mathematical correctness. That is a fixed table available on diskette!

>> From a practical point of view, you would be required to generate a new set of random values for each query. I am also curious to know how you use this table to restrict the rows being returned in the general case to any kind of uniform random sample. <<

I start at the front of the RAND Corporation table and pull digits until I get to the end. I can go to some university sites and get bigger validated tables, too. Received on Fri Sep 21 2007 - 01:01:56 CEST

Original text of this message