Re: Histogram/Bucketizing resultset

From: Mark Townsend <markbtownsend_at_attbi.com>
Date: Sat, 02 Nov 2002 19:56:54 GMT
Message-ID: <B9E96DF2.D86%markbtownsend@attbi.com>

in article 3DC352BB.C4BF69FA_at_drew.edu, Steve Kass at skass_at_drew.edu wrote on 11/1/02 8:21 PM:

> No reason not to have a database in front of you - the enterprise evaluation
> edition of SQL Server is free and lasts for 120 days...
> 
> http://www.microsoft.com/sql/evaluation/trial/

Hmm - won't run on my MAC's at home, unfortunately :-)

> 
> Anyway, one way or another, if you find a better solution to the bucketing,
> let us know.
>

Here's the Oracle one - note the following (self imposed) rules

A single SQL Statement
Upper and Lower bounds derived from the actual data
Must be able to deal with a differing number of buckets

Here's the single SQL statement

SELECT TO_CHAR(b.l+((b.u-b.l)/&i * (b.n-1)))

       ||' - '|| 
       TO_CHAR(b.l+((b.u-b.l)/&i *  b.n   )) Range,
       b.c Count
FROM (SELECT 
        max(bound.upper) u,
        min(bound.lower) l,
        width_bucket(age,bound.lower,bound.upper,&i) n,
        count(*) c
      FROM (SELECT max(age)+1 upper,min(age) lower FROM t) bound, t
      GROUP BY width_bucket(age,bound.lower,bound.upper,&i)) b

Testing it on Oracle9i Release 2 shows as follows

Create a table t with some data - use the supplied census data

DROP TABLE t;

Table dropped.

CREATE TABLE t AS SELECT age FROM odm_mtr.census_2d_apply_unbinned;

Table created.

Show the ranges and total number of rows

SELECT min(age), max(age), count(*) FROM t;

MIN(AGE) MAX(AGE) COUNT(*)
---------- ---------- ----------

17 90 1226

Set up some display parameters for SQL*Plus COLUMN RANGE FORMAT A20 SET SHOWMODE OFF SET VERIFY OFF SET PAGESIZE 99
Accept the number of bins required
Outside of SQL*Plus, i would be a bind variable ACCEPT i NUMBER FORMAT '999' DEFAULT 10 - PROMPT 'Enter Number of Buckets Required: '

Enter Number of Buckets Required: 10

RANGE                     COUNT
-------------------- ----------
17 - 24.4                   179
24.4 - 31.8                 232
31.8 - 39.2                 241
39.2 - 46.6                 218
46.6 - 54                   160
54 - 61.4                   116
61.4 - 68.8                  53
68.8 - 76.2                  17
76.2 - 83.6                   7
83.6 - 91                     3

10 rows selected.

Enter Number of Buckets Required: 20

RANGE                     COUNT
-------------------- ----------
17 - 20.7                    77
20.7 - 24.4                 102
24.4 - 28.1                 118
28.1 - 31.8                 114
31.8 - 35.5                 123
35.5 - 39.2                 118
39.2 - 42.9                 101
42.9 - 46.6                 117
46.6 - 50.3                 107
50.3 - 54                    53
54 - 57.7                    64
57.7 - 61.4                  52
61.4 - 65.1                  40
65.1 - 68.8                  13
68.8 - 72.5                  11
72.5 - 76.2                   6
76.2 - 79.9                   3
79.9 - 83.6                   4
83.6 - 87.3                   1
87.3 - 91                     2

20 rows selected.

> 
> SK
> 
> Mark Townsend wrote:
> 
>> in article B9E741D5.C3D%markbtownsend_at_attbi.com, Mark Townsend at
>> markbtownsend_at_attbi.com wrote on 10/31/02 8:24 PM:
>>

>>> Hence the WIDTH_BUCKET function - something like (I don't have Oracle at
>>> home to test this) -
>>>
>>> SELECT WIDTH_BUCKET(age, lbound, ubound, nbr_buckets) "Bucket Nbr", count(*)
>>> FROM t
>>> GROUP BY "Bucket Nbr";
>>>
>>> where lbound is min(age), ubound is max(age), and nbr_buckets is the number
>>> of buckets you want (10)
>>>
>>> Note that Bucket Nbr would be a number i.e 1 for the first bucket, 2 for the
>>> second bucket etc.
>>>
>>> To turn this number into the actual range statement, you would need
>>> something like an inline view
>>>
>>> SELECT lbound+(((ubound-lboundb)/nbr_buckets) * (b.bucket_nbr-1))||' - '||
>>> lbound+(((ubound-lboundb)/nbr_buckets) * b.bucket_nbr) "Bucket",
>>> val "Count"
>>> FROM (SELECT WIDTH_BUCKET(age, lbound, ubound, nbr_buckets) "Bucket_Nbr",
>>> count(*) val
>>> FROM t
>>> GROUP BY "Bucket") b
>>>
>>> Note that this will work (assuming the theory is correct) for any range of
>>> values, but more importantly for any number of buckets. YMMV, as there is
>>> bound to be some debugging required for the above pseudo code.

>> 
>> Apologies - I just realized I missed the most significant part of your
>> follow up post
>> 
>>>> Assume that the lower bound and the upper bound is not preset
>> 
>> I'll need to think about this a bit more, at the office, with a database in
>> front of me.

Received on Sat Nov 02 2002 - 13:56:54 CST