Re: weighted random selection of record

From: Stuart McGraw <smcgraw_no_spam_me_at_frii.com>
Date: Wed, 21 May 2003 11:58:47 -0600
Message-ID: <3ecbbe58$0$197$75868355_at_news.frii.net>


I came up with a query in another response that also solves the problem given that I am willing to redefine the meaning on the [weight] attribute (i.e. change to cumulative values). Your query does not require doing that (and also uses some other methods that I hadn't thought of). Thanks!

"John Gilson" <jag_at_acm.org> wrote in message news:hLDya.43430$h42.29572_at_twister.nyc.rr.com...
> "Stuart McGraw" <smcgraw_no_spam_me_at_frii.com> wrote in message
> news:3eca4df9$0$200$75868355_at_news.frii.net...
> > I'm not sure this is the right group but since it is a
> > somewhat theoretical question and involves databases...
> > Pointers to more appropriate groups welcome.
> >
> > I want a query that will randomly select a row from
> > a table but with a probability determined by a "weight"
> > value stored in the rows. For example, if I have the
> > table
> >
> > id weight
> > -- ------
> > 1 .6
> > 2 .1
> > 3 .3
> >
> > then I want to be able to run the query 100 times and
> > get the first row back about 60 times, the second
> > row about 10 times, and the third row about 30
> > times. (The values of the weights don't matter, only
> > the ratios between them; the weight numbers above
> > happen to be normalized to a sum of 1.0)
> >
> > What I tried was
> >
> > SELECT TOP 1 * FROM tbl ORDER BY rnd()*[weight]
> >
> > where rnd() is a function that returns a uniformly
> > distributed real number between 0 and 1. This almost
> > does the right thing. The problem is that the resulting
> > distribution is not what one would (naively) hope.
> > Running that query on the table
> >
> > id weight
> > -- ------
> > 1 1
> > 2 3
> >
> > one might hope would give row #1 25% (1/4) of the
> > time and row #2 75% (3/4) of the time. In fact,
> > it gives row #1 16% (1/6) of the time and row #2
> > 84% (5/6) of the time. For two rows (assuming
> > weight(1) is 1 and weight(2) is W), row #1 is
> > returned with a probability of 1/(2*W) (this is
> > easy to determine analytically.)
> >
> > I have tried to work out the math for 3 rows but my
> > analytical results don't agree with my empirical ones,
> > and N rows is beyond my primitive skills. Still, it
> > seems obvious to me that the weights can be adjusted
> > by some amount that will result in the query producing
> > any desired distribution of rows.
> >
> > So the question is, if I want to select one row from
> > rows (1,2,3,...,N) with probabilities (p1,p2,p3,...,pn),
> > how do I find the values (w1,w2,w3,...,wn) that should
> > go in column [weight]?
> >
> > It seems to me that this is such a basic problem that
> > someone must have encountered/solved it before.
>
> CREATE TABLE T
> (
> id INT NOT NULL PRIMARY KEY,
> weight FLOAT NOT NULL CHECK (weight >= 0)
> )
>
> -- Sample data
> INSERT INTO T
> VALUES (1, 75)
> INSERT INTO T
> VALUES (2, 25)
>
> CREATE VIEW WeightedRandomSelection
> AS
> SELECT TOP 1 Ranges.id,
> Ranges.weight,
> Ranges.lower_bound,
> Ranges.upper_bound,
> RND.n AS random
> FROM (SELECT W1.id,
> W1.weight,
> SUM(W2.normalized_weight) -
> W1.normalized_weight AS lower_bound,
> SUM(W2.normalized_weight) AS upper_bound
> FROM (SELECT id,
> weight,
> COALESCE(CAST(weight AS FLOAT) /
> (SELECT NULLIF(SUM(weight), 0) FROM T), 0)
> FROM T) AS W1(id, weight, normalized_weight)
> INNER JOIN
> (SELECT COALESCE(CAST(weight AS FLOAT) /
> (SELECT NULLIF(SUM(weight), 0) FROM T), 0)
> FROM T) AS W2(normalized_weight)
> ON W2.normalized_weight <= W1.normalized_weight
> GROUP BY W1.id, W1.weight, W1.normalized_weight) AS Ranges
> CROSS JOIN
> (SELECT RAND()) AS RND(n)
> WHERE RND.n BETWEEN lower_bound AND upper_bound
>
> Let's see if we get roughly a 3:1 ratio between id = 1 and id = 2
> if we execute the above SELECT 100 times.
>
> -- _at_id1 will count the number of times the id = 1 row is chosen.
> -- _at_id2 will count the number of times the id = 2 row is chosen.
> DECLARE _at_counter INT, @id1 INT, @id2 INT
> SELECT _at_counter = 1, @id1 = 0, @id2 = 0
> WHILE (_at_counter <= 100)
> BEGIN
> SELECT _at_id1 = @id1 + CASE WHEN id = 1 THEN 1 ELSE 0 END,
> _at_id2 = @id2 + CASE WHEN id = 2 THEN 1 ELSE 0 END
> FROM WeightedRandomSelection
> SELECT _at_counter = @counter + 1
> END
> SELECT 'id = 1 count is ' + CAST(_at_id1 AS VARCHAR),
> 'id = 2 count is ' + CAST(_at_id2 AS VARCHAR)
>
> Running the above code 5 times I get the following results:
> id = 1 count is 69 id = 2 count is 31
> id = 1 count is 84 id = 2 count is 16
> id = 1 count is 79 id = 2 count is 21
> id = 1 count is 74 id = 2 count is 26
> id = 1 count is 72 id = 2 count is 28
>
> So the average number of times id = 1 for the 5 runs is 75.6 and
> the average number of times id = 2 for the 5 runs is 24.4.
>
> Regards,
> jag
>
>
Received on Wed May 21 2003 - 19:58:47 CEST

Original text of this message