Top 5 values for each column
Date: 30 Jan 2003 08:34:19 -0800
Message-ID: <ac5bc7c1.0301300834.27cbff36_at_posting.google.com>
Hi all,
we need to report top 5 most frequent not null values for each column
in relatively large tables having up to 200 columns. The draft
solution we have right now is:
- Create a histogramm table create table histogramm(colname char(30), VALUE CHAR(10), freq number(5));
- Populate it 2.1 Using pure SQL, like this (performance may suck):
2 select 'claim' as fieldname, claim as value, count(*) as freq from
XXX group by claim
3 union all
4 select 'trans_dt' as fieldname, cast(trans_dt as char(20)) as
value, count(*) as freq from XXX group by trans_dt
5 union all
6 select 'seq' as fieldname, cast(seq as char(20)) as value,
count(*) as freq from XXX group by seq;
and so on, number of columns up to 200 and maybe more
2.2 or do the same using a cursor
3. Retrieve the results:
SQL> SELECT * FROM
2 (SELECT ROWNUM AS RN, COLNAME, VALUE, FREQ FROM
3 (SELECT COLNAME, VALUE, FREQ FROM histogramm ORDER BY COLNAME,
FREQ DESC) T0
4 ) T1
5 WHERE RN-5 < (SELECT MIN(RN) FROM
6 (SELECT ROWNUM AS RN, COLNAME, VALUE, FREQ FROM
7 (SELECT COLNAME, VALUE, FREQ FROM histogramm ORDER BY COLNAME,
FREQ DESC) T3
8 ) T4
9 WHERE T4.COLNAME = T1.COLNAME
10 );
Row number may be added to histogramm table to simplify / speed up the query, but step 3 does not concern me too much.
Theoretically it works but my concern is performance of step 2.
What do you think?
I'd appreciate any comments.
P.S. it is very easy to accomplish in DB2, just run statistics with distribution and top values for every column are right over there in a system view ... Received on Thu Jan 30 2003 - 17:34:19 CET