Re: weighted random selection of record

From: John Gilson <jag_at_acm.org>
Date: Wed, 21 May 2003 05:02:05 GMT
Message-ID: <hLDya.43430$h42.29572_at_twister.nyc.rr.com>


"Stuart McGraw" <smcgraw_no_spam_me_at_frii.com> wrote in message news:3eca4df9$0$200$75868355_at_news.frii.net...
> I'm not sure this is the right group but since it is a
> somewhat theoretical question and involves databases...
> Pointers to more appropriate groups welcome.
>
> I want a query that will randomly select a row from
> a table but with a probability determined by a "weight"
> value stored in the rows. For example, if I have the
> table
>
> id weight
> -- ------
> 1 .6
> 2 .1
> 3 .3
>
> then I want to be able to run the query 100 times and
> get the first row back about 60 times, the second
> row about 10 times, and the third row about 30
> times. (The values of the weights don't matter, only
> the ratios between them; the weight numbers above
> happen to be normalized to a sum of 1.0)
>
> What I tried was
>
> SELECT TOP 1 * FROM tbl ORDER BY rnd()*[weight]
>
> where rnd() is a function that returns a uniformly
> distributed real number between 0 and 1. This almost
> does the right thing. The problem is that the resulting
> distribution is not what one would (naively) hope.
> Running that query on the table
>
> id weight
> -- ------
> 1 1
> 2 3
>
> one might hope would give row #1 25% (1/4) of the
> time and row #2 75% (3/4) of the time. In fact,
> it gives row #1 16% (1/6) of the time and row #2
> 84% (5/6) of the time. For two rows (assuming
> weight(1) is 1 and weight(2) is W), row #1 is
> returned with a probability of 1/(2*W) (this is
> easy to determine analytically.)
>
> I have tried to work out the math for 3 rows but my
> analytical results don't agree with my empirical ones,
> and N rows is beyond my primitive skills. Still, it
> seems obvious to me that the weights can be adjusted
> by some amount that will result in the query producing
> any desired distribution of rows.
>
> So the question is, if I want to select one row from
> rows (1,2,3,...,N) with probabilities (p1,p2,p3,...,pn),
> how do I find the values (w1,w2,w3,...,wn) that should
> go in column [weight]?
>
> It seems to me that this is such a basic problem that
> someone must have encountered/solved it before.

CREATE TABLE T
(
id INT NOT NULL PRIMARY KEY,
weight FLOAT NOT NULL CHECK (weight >= 0) )

  • Sample data INSERT INTO T VALUES (1, 75) INSERT INTO T VALUES (2, 25)
CREATE VIEW WeightedRandomSelection
AS
SELECT TOP 1 Ranges.id,
                           Ranges.weight,
                           Ranges.lower_bound,
                           Ranges.upper_bound,
                           RND.n AS random
FROM (SELECT W1.id,
                             W1.weight,
                             SUM(W2.normalized_weight) -
                               W1.normalized_weight AS lower_bound,
                             SUM(W2.normalized_weight) AS upper_bound
              FROM (SELECT id,
                                           weight,
                                           COALESCE(CAST(weight AS FLOAT) /
                                              (SELECT NULLIF(SUM(weight), 0) FROM T), 0)
                            FROM T) AS W1(id, weight, normalized_weight)
                          INNER JOIN
                          (SELECT COALESCE(CAST(weight AS FLOAT) /
                                             (SELECT NULLIF(SUM(weight), 0) FROM T), 0)
                            FROM T) AS W2(normalized_weight)
                          ON W2.normalized_weight <= W1.normalized_weight
              GROUP BY W1.id, W1.weight, W1.normalized_weight) AS Ranges
            CROSS JOIN
            (SELECT RAND()) AS RND(n)

WHERE RND.n BETWEEN lower_bound AND upper_bound

Let's see if we get roughly a 3:1 ratio between id = 1 and id = 2 if we execute the above SELECT 100 times.

  • _at_id1 will count the number of times the id = 1 row is chosen.
  • _at_id2 will count the number of times the id = 2 row is chosen. DECLARE _at_counter INT, @id1 INT, @id2 INT SELECT _at_counter = 1, @id1 = 0, @id2 = 0 WHILE (_at_counter <= 100) BEGIN SELECT _at_id1 = @id1 + CASE WHEN id = 1 THEN 1 ELSE 0 END, _at_id2 = @id2 + CASE WHEN id = 2 THEN 1 ELSE 0 END FROM WeightedRandomSelection SELECT _at_counter = @counter + 1 END SELECT 'id = 1 count is ' + CAST(_at_id1 AS VARCHAR), 'id = 2 count is ' + CAST(_at_id2 AS VARCHAR)

Running the above code 5 times I get the following results:

id = 1 count is 69 id = 2 count is 31
id = 1 count is 84 id = 2 count is 16
id = 1 count is 79 id = 2 count is 21
id = 1 count is 74 id = 2 count is 26
id = 1 count is 72 id = 2 count is 28

So the average number of times id = 1 for the 5 runs is 75.6 and the average number of times id = 2 for the 5 runs is 24.4.

Regards,
jag Received on Wed May 21 2003 - 07:02:05 CEST

Original text of this message