Re: Randomize data from SQL statement

From: Gus <gus_goose_at_hotmail.com>
Date: 1998/09/21
Message-ID: <36064925.4096_at_hotmail.com>


Tricia Kha wrote:
>
> Hi,
>
> I want to generate a 30% random sample output data from my Order_number field
> of my Order_item table. Is there any technique or function that I could use?. I
> would like it to be done in SQL Statement not in PL/SQL.
>
> Thanks
> Tricia Kha
> tkha_at_accesscom.com

Sounds like a nice problem....

try this .... it will work, but may be slowish...

select random_column = Identity(10), <columns to be returned> into #temp_table
from source_table
order by (some intentionally meaningless thing like if you have a datetime column such as a "created date", the DateDiff (ms, Convert (datetime, Convert (varchar (30), create_tm, 106)), create_tm) This will sort it by the time after midnight that the record was created)

In the temp_table we now have all the data in a meaningless order (read, random order) Two things can happen, we can either select the first 30% (or n'th 30%) like this ....

select _at_rows = @@rowcount -- obviously, put this immediately after the select-into

select *
from temp_table
where random_column <= 0.3 * _at_rows

OR do a randomish 30%. I like the idea of the modulo, like this

select * from temp_table
where Convert (integer, random_column) % 100 < 30

This will obviously only work properly if there are greater than a minimum of 100 records. If it is exactly 30 percent, then you can take a larger granularity, and have:

select * from temp_table
where Convert (integer, random_column) % 10 < 3

Either way, you get what you want. Using the modulo, you incorporate two random elements, which should give you a better sample than if you only use some randomish sort order.

The principle is sound, I think. There are glaring problems like the performance which will be terrible, but I presume this is not going to be a regular event. If it is, then you should probably build some random element in to the data as it is inserted.

Now I realise that I have overlooked the obvious ... what about the random element .....

Try this ......

select *
from source_table
where rand (DatePart (ms, update_tm)) < 0.3

The rand() function will require a seed otherwise it is evaluated *before* the query is run, and you will either *everything* or *nothing*. The seed for the rand() function must therefore reference a column in some way. I used the update_time column of the test table I was using. The reuslts I got are interesting though, to say the least. With the above query, I *always* get the same records retrieved, and I *always* get about 60% of the records, not the expected 0.3 * 100 or 30%. Maybe there should be a second level of randomness thrown in. Use the current time ... like so:

select *
from source_table
where rand (DatePart (ms, update_tm) + DatePart (ms, Getdate())) < 0.3

I am getting seriously wierd results ... from what I see, ... shit .... ;(

I have been confused with this rand() function. It is really quite shit for this purpose, anyway. It seems that rand (x) will always be < rand (x + 1). This will lead to interesting results. I know that random number generators are prone to not being random, but I would have never expected that it generates a number which is proportional to the seed.

I have had a look at the documentation (System 10), and I see a bit of text ... "The rand function uses the output of a 32-bit pseudo-random integer generator. The integer is divided by the maximum 32-bit integer to give a double value between 0.0 and 1.0"

So, the only real way to use the rand () function is to add a further element of randomness. Take the last significant digit, and use that, or something like it.

select colA, colB, ....
from source_table
where Convert (integer, Right (Convert (varchar (40), 1000 * rand (Convert (integer, unique_column))), 2)) < 30

Note, the results from this process will always be the same no matter how many times you run it, if the underlying data does not change.

Add in the Getdate() which will be "unique" at all times, you can get ... something like this ... which will always return one of 1000 possible sets of data ....

select colA, colB, ....
from source_table
where Convert (integer, Right (Convert (varchar (40), 1000 * (rand (datepart (ms, Getdate()) + rand (Convert (integer, unique_column)))), 2)) < 30

That should be enough for your purposes ....

Thus, what I would put my weight behind is either the modulo of a "random" identity column, or the use of the above method of adding two "random" numbers together, one based on the data, and the other based on the current time. Take the last two significant figures from this, and if they are less than 30, it is your sample ...

The benefits of the second one are that you are guaranteed about 1000 *random* sets of data (it is 1000 because that is how many miliseconds there are in a second). Thinking about it, it is silly to call the rand () function twice, so the above can be re-written as follows:

select colA, colB, ....
from source_table
where (DatePart (ms, Getdate()) + Convert (integer, Right (Convert (varchar (40), 1000 * rand (Convert (integer, unique_column))), 3))) % 100 < 30

What this does is as follows:
Generate a "random" number using dome data-dependant value as a seed (I use "unique_column" (any identity column will do) which has to be converted to an integer. Take the last three digits from this number (you have to multiply the number by 1000 so that the number is always in the format 0.xxxx and not "x.xxx e yy"). Convert these last three digits back to integers, and add these to the number of milliseconds it is currently past the second. If the last two digits ( found by "% 100") are less than 30, it falls in to our sample. That is about as random as I think you should need.

How is that?

This code should work for you directly now:

select Order_number
from Order_item
where (DatePart (ms, Getdate()) + Convert (integer, Right (Convert (varchar (40), 1000 * rand (Convert (integer, Order_number))), 3))) % 100 < 30

As far as I can see, that is probably the most efficient method, there is only one scan required, and it may be an index scan if all the required data is covered ...

Some problems I can see is the conversion of identity columns to integers where there is potential for overflows. If Order_number > 2,147,483,647 you have a problem.

Also, I had problems with the convert to varchar of double values, which was a bit temperamental, and thus I had to put in the "1000 * ". This is a less than ideal fix ...

Anyway, good luck.

gus Received on Mon Sep 21 1998 - 00:00:00 CEST

Original text of this message