Re: Group by of 400 million rows to deal with dupes is it best way?

From: Martin Haltmayer <Martin.Haltmayer_at_0800-einwahl.de>
Date: Sun, 06 May 2001 19:34:46 +0200
Message-ID: <3AF58B36.85574C2D@0800-einwahl.de>

You mentioned an already very good idea:

select bill_month

, account
, cons_type
, business_id
, busine_spec
, sum (cons_amount) as cons_amount

from bigtab
group by bill_month, account, cons_type
order by 1, 2, 3
/

If you set your sort_area_size to a fairly huge amount and if you can parallelize your query this should not be a challenge for your system. Assume you have your data as dense as possible packed into your pages. Further assume you have your pages at 16 KB size. Then you should have rows with about 30 bytes size. You should have around 16000/30 rows in each block or about 530 rows per block. This means you have to read about 400,000,000/530 = 755,000 blocks.

Assume you striped your disks so that the SCSI channel is your bottleneck with about 80 MB/s bandwith which is something like 5120 blocks/s this means that it takes you 147 s to read your complete table in a full table scan.

If you assume you have to write and to read the complete table five times for grouping and sorting this makes something like 1500 s < 0.5 h. So what are you afraid of? 400 Mrows are not *that* many.

Martin Received on Sun May 06 2001 - 12:34:46 CDT