Top 5 values for each column

From: Alexander Kuznetsov <Alexander.Kuznetsov_at_marshmc.com>
Date: 30 Jan 2003 08:34:19 -0800
Message-ID: <ac5bc7c1.0301300834.27cbff36_at_posting.google.com>


Hi all,
we need to report top 5 most frequent not null values for each column in relatively large tables having up to 200 columns. The draft solution we have right now is:

  1. Create a histogramm table create table histogramm(colname char(30), VALUE CHAR(10), freq number(5));
  2. Populate it 2.1 Using pure SQL, like this (performance may suck):

 2 select 'claim' as fieldname, claim as value, count(*) as freq from XXX group by claim
  3 union all
  4 select 'trans_dt' as fieldname, cast(trans_dt as char(20)) as value, count(*) as freq from XXX group by trans_dt   5 union all
  6 select 'seq' as fieldname, cast(seq as char(20)) as value, count(*) as freq from XXX group by seq;

and so on, number of columns up to 200 and maybe more

2.2 or do the same using a cursor

3. Retrieve the results:

SQL> SELECT * FROM
  2 (SELECT ROWNUM AS RN, COLNAME, VALUE, FREQ FROM   3 (SELECT COLNAME, VALUE, FREQ FROM histogramm ORDER BY COLNAME, FREQ DESC) T0
  4 ) T1
  5 WHERE RN-5 < (SELECT MIN(RN) FROM
  6 (SELECT ROWNUM AS RN, COLNAME, VALUE, FREQ FROM   7 (SELECT COLNAME, VALUE, FREQ FROM histogramm ORDER BY COLNAME, FREQ DESC) T3
  8 ) T4
  9 WHERE T4.COLNAME = T1.COLNAME
 10 );

Row number may be added to histogramm table to simplify / speed up the query, but step 3 does not concern me too much.

Theoretically it works but my concern is performance of step 2.

What do you think?
I'd appreciate any comments.

P.S. it is very easy to accomplish in DB2, just run statistics with distribution and top values for every column are right over there in a system view ... Received on Thu Jan 30 2003 - 17:34:19 CET

Original text of this message