weighted random selection of record

From: Stuart McGraw <smcgraw_no_spam_me_at_frii.com>
Date: Tue, 20 May 2003 09:47:04 -0600
Message-ID: <3eca4df9$0$200$75868355_at_news.frii.net>



I'm not sure this is the right group but since it is a somewhat theoretical question and involves databases... Pointers to more appropriate groups welcome.

I want a query that will randomly select a row from a table but with a probability determined by a "weight" value stored in the rows. For example, if I have the table

         id    weight

-- ------
1 .6 2 .1 3 .3

then I want to be able to run the query 100 times and get the first row back about 60 times, the second row about 10 times, and the third row about 30 times. (The values of the weights don't matter, only the ratios between them; the weight numbers above happen to be normalized to a sum of 1.0)

What I tried was

    SELECT TOP 1 * FROM tbl ORDER BY rnd()*[weight]

where rnd() is a function that returns a uniformly distributed real number between 0 and 1. This almost does the right thing. The problem is that the resulting distribution is not what one would (naively) hope. Running that query on the table

         id    weight

-- ------
1 1 2 3

one might hope would give row #1 25% (1/4) of the time and row #2 75% (3/4) of the time. In fact, it gives row #1 16% (1/6) of the time and row #2 84% (5/6) of the time. For two rows (assuming weight(1) is 1 and weight(2) is W), row #1 is returned with a probability of 1/(2*W) (this is easy to determine analytically.)

I have tried to work out the math for 3 rows but my analytical results don't agree with my empirical ones, and N rows is beyond my primitive skills. Still, it seems obvious to me that the weights can be adjusted by some amount that will result in the query producing any desired distribution of rows.

So the question is, if I want to select one row from rows (1,2,3,...,N) with probabilities (p1,p2,p3,...,pn), how do I find the values (w1,w2,w3,...,wn) that should go in column [weight]?

It seems to me that this is such a basic problem that someone must have encountered/solved it before. Received on Tue May 20 2003 - 17:47:04 CEST

Original text of this message