Re: Stochastic Queries

From: DBMS_Plumber <paul_geoffrey_brown_at_yahoo.com>
Date: Thu, 20 Sep 2007 16:49:28 -0700
Message-ID: <1190332168.478714.113200_at_t8g2000prg.googlegroups.com>


On Sep 20, 4:01 pm, -CELKO- <jcelko..._at_earthlink.net> wrote:
> >> I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft. <<
>
> I need to get a copy, too.

 It doesn't talk about this subject. Only about generating lots of independent SQL queries to test the robustness of the SQL processor. It's fascinating - but not apropos.

> You might want to look at a current article by Ben Gan on the use of
> RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
> covers the problems with how a CASE expression uses RAND(),
> deterministic functions, duplicate values, etc.

 Familiar with it.

 As an aside, Postgres had random number generation functions in 1993. In fact, that system developed an entire class of optimizer modifications to cope with the general class of 'non deterministic' functions, which includes 'rand()', but just as easily might apply to 'has_the_missile_been_fired_yet()'. It was pleasing to see SQL Server arrive at the same conclusions 15 years later. <snark off>

> >> I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.) <<
>
> The SQL products I have worked with use traditional Linear
> Congruential algorithms, which they inherited from C and UNIX. As you
> get larger and larger samples, you get skewing and duplicate values.
> Knuth Vol #2, Chapter 3 has a good history and some remarks about the
> history.
>
> The best (worst?) horror story in the 1970's was the discovery that an
> IBM FORTRAN routine was not valid. It trashed quite a few PhD
> projects. That was the most popular tool in those days for
> research.

   I'm intimately familiar with this literature. There's also an excellent discussion in the more recent _Numerical_Recipes_ books.

   But how do you know what RNG these products use? Did you see their source code?

   Writing a very good RNG is not hard. It's 10% scholarship, 10% code effort, and 80% testing.

 [ snip ]

> However, consider that one of the advantages of RNG is that you can
> repeat an experiment.

   There are a large number of RNG algorithms which, when handed the same seed, generate the same sequence of values, making it possible to use of the same list of random values as often as one would like.

 [ snip ]

> But if I want a fixed table, I think most statisticians would agree
> that the RAND corporation "Table of One Million Random Digits" has
> been tested in every possible way for mathematical correctness. That
> is a fixed table available on diskette!

 And is the same set of numbers used repeatedly. Which means subsequent samples derived using the same table of values -- or samples derived from other sources using the same table of values -- are no longer independent.

 [ snip ]

  A raft of other practical problems present themselves with the use of a static table. I'll describe one of them.

  To use that table of random numbers, you need to plug it into a query somehow. This means either a) modifying the table to add a column containing a random value, or b) writing a join. Now the problem with either approach is that a query processor--not being aware of the random nature of what is going on--is going to move the restriction or the join operation to wherever the physical planner sees fit to place it.

  Confronted with these problems -- customers wanting bernoulli samples, for instance, but getting instead samples whose size distribution was all over the shop -- DBMS vendors added random number generators and sampling algorithms to their products in very deep ways. It's not just the quality of the RNG that counts. The whole query processor needs to change.

  And even then, users wonder why the sampling query is no faster than the original. So they added yet another layer of complexity to make it go fast, as well as generate "statistically rigorous" samples. (Note: They're not. But they're usually good enough for government work.)

  In a nutshell, DBMS vendors (well - the one or two I am most familiar with) added random() functions and sampling with enormous care. They deal with a small number of very, very sophisticated customers who subject these products to tremendous amounts of validation. If these customers are disappointed, they get upset. (If they're happy they never say so. Such is life.) Received on Fri Sep 21 2007 - 01:49:28 CEST

Original text of this message