Re: Histogram/Bucketizing resultset

From: Mark Townsend <markbtownsend_at_attbi.com>
Date: Sat, 02 Nov 2002 23:54:09 GMT
Message-ID: <B9E9A58D.DC9%markbtownsend@attbi.com>

in article B9E96DF2.D86%markbtownsend_at_attbi.com, Mark Townsend at markbtownsend_at_attbi.com wrote on 11/2/02 11:56 AM:

The corollary example is interesting too - If you want this done as equiheight histograms, the NTILE capabilities simplify life considerably

SELECT min(b.age)||' - '||max(b.age) Range,

count(*) Count
FROM(SELECT age,

            ntile(&i) over(order by age) as bucket
     FROM t) b

GROUP BY b.bucket
/

Gives:-

Enter Number of Buckets Required: 10

RANGE                     COUNT
-------------------- ----------
17 - 22                     123
22 - 27                     123
27 - 30                     123
30 - 34                     123
34 - 38                     123
38 - 42                     123
42 - 46                     122
46 - 51                     122
51 - 58                     122
58 - 90                     122

10 rows selected.

Enter Number of Buckets Required: 20

RANGE                     COUNT
-------------------- ----------
17 - 20                      62
20 - 22                      62
23 - 25                      62
25 - 27                      62
27 - 29                      62
29 - 30                      62
30 - 32                      61
32 - 34                      61
34 - 36                      61
36 - 38                      61
38 - 40                      61
40 - 42                      61
42 - 44                      61
44 - 46                      61
46 - 48                      61
48 - 51                      61
51 - 54                      61
54 - 58                      61
58 - 63                      61
63 - 90                      61

20 rows selected.

> in article 3DC352BB.C4BF69FA_at_drew.edu, Steve Kass at skass_at_drew.edu wrote on
> 11/1/02 8:21 PM:
>

>> No reason not to have a database in front of you - the enterprise evaluation
>> edition of SQL Server is free and lasts for 120 days...
>> 
>> http://www.microsoft.com/sql/evaluation/trial/

>
> Hmm - won't run on my MAC's at home, unfortunately :-)

>> 
>> Anyway, one way or another, if you find a better solution to the bucketing,
>> let us know.
>>

>
> Here's the Oracle one - note the following (self imposed) rules
>
> 1) A single SQL Statement
> 2) Upper and Lower bounds derived from the actual data
> 3) Must be able to deal with a differing number of buckets
>
> Here's the single SQL statement
>
> SELECT TO_CHAR(b.l+((b.u-b.l)/&i * (b.n-1)))
> ||' - '||
> TO_CHAR(b.l+((b.u-b.l)/&i * b.n )) Range,
> b.c Count
> FROM (SELECT
> max(bound.upper) u,
> min(bound.lower) l,
> width_bucket(age,bound.lower,bound.upper,&i) n,
> count(*) c
> FROM (SELECT max(age)+1 upper,min(age) lower FROM t) bound, t
> GROUP BY width_bucket(age,bound.lower,bound.upper,&i)) b
>
> Testing it on Oracle9i Release 2 shows as follows
>
> -- Create a table t with some data - use the supplied census data
>
> DROP TABLE t;
>
> Table dropped.
>
> CREATE TABLE t AS SELECT age FROM odm_mtr.census_2d_apply_unbinned;
>
> Table created.
>
> -- Show the ranges and total number of rows
>
> SELECT min(age), max(age), count(*) FROM t;
>
> MIN(AGE) MAX(AGE) COUNT(*)
> ---------- ---------- ----------
> 17 90 1226
>
> -- Set up some display parameters for SQL*Plus
> COLUMN RANGE FORMAT A20
> SET SHOWMODE OFF
> SET VERIFY OFF
> SET PAGESIZE 99
>
> -- Accept the number of bins required
> -- Outside of SQL*Plus, i would be a bind variable
> ACCEPT i NUMBER FORMAT '999' DEFAULT 10 -
> PROMPT 'Enter Number of Buckets Required: '
>
> Enter Number of Buckets Required: 10
>
> RANGE COUNT
> -------------------- ----------
> 17 - 24.4 179
> 24.4 - 31.8 232
> 31.8 - 39.2 241
> 39.2 - 46.6 218
> 46.6 - 54 160
> 54 - 61.4 116
> 61.4 - 68.8 53
> 68.8 - 76.2 17
> 76.2 - 83.6 7
> 83.6 - 91 3
>
> 10 rows selected.
>
> Enter Number of Buckets Required: 20
>
> RANGE COUNT
> -------------------- ----------
> 17 - 20.7 77
> 20.7 - 24.4 102
> 24.4 - 28.1 118
> 28.1 - 31.8 114
> 31.8 - 35.5 123
> 35.5 - 39.2 118
> 39.2 - 42.9 101
> 42.9 - 46.6 117
> 46.6 - 50.3 107
> 50.3 - 54 53
> 54 - 57.7 64
> 57.7 - 61.4 52
> 61.4 - 65.1 40
> 65.1 - 68.8 13
> 68.8 - 72.5 11
> 72.5 - 76.2 6
> 76.2 - 79.9 3
> 79.9 - 83.6 4
> 83.6 - 87.3 1
> 87.3 - 91 2
>
> 20 rows selected.
>

>> 
>> SK
>> 
>> Mark Townsend wrote:
>> 
>>> in article B9E741D5.C3D%markbtownsend_at_attbi.com, Mark Townsend at
>>> markbtownsend_at_attbi.com wrote on 10/31/02 8:24 PM:
>>> 
>>>> Hence the WIDTH_BUCKET function - something like (I don't have Oracle at
>>>> home to test this) -
>>>> 
>>>> SELECT WIDTH_BUCKET(age, lbound, ubound, nbr_buckets) "Bucket Nbr",
>>>> count(*)
>>>> FROM t
>>>> GROUP BY "Bucket Nbr";
>>>> 
>>>> where lbound is min(age), ubound is max(age), and nbr_buckets is the number
>>>> of buckets you want (10)
>>>> 
>>>> Note that Bucket Nbr would be a number i.e 1 for the first bucket, 2 for
>>>> the
>>>> second bucket etc.
>>>> 
>>>> To turn this number into the actual range statement, you would need
>>>> something like an inline view
>>>> 
>>>> SELECT lbound+(((ubound-lboundb)/nbr_buckets) * (b.bucket_nbr-1))||' - '||
>>>> lbound+(((ubound-lboundb)/nbr_buckets) *  b.bucket_nbr) "Bucket",
>>>> val "Count"
>>>> FROM  (SELECT WIDTH_BUCKET(age, lbound, ubound, nbr_buckets) "Bucket_Nbr",
>>>> count(*) val
>>>> FROM t
>>>> GROUP BY "Bucket") b
>>>> 
>>>> Note that this will work (assuming the theory is correct) for any range of
>>>> values, but more importantly for any number of buckets. YMMV, as there is
>>>> bound to be some debugging required for the above pseudo code.
>>> 
>>> Apologies - I just realized I missed the most significant part of your
>>> follow up post
>>> 
>>>>> Assume that the lower bound and the upper bound is not preset
>>> 
>>> I'll need to think about this a bit more, at the office, with a database in
>>> front of me.

>
Received on Sat Nov 02 2002 - 17:54:09 CST