Group by of 400 million rows to deal with dupes is it best way?

From: sdfsd <sdfes_at_dsf.com>
Date: 6 May 2001 08:42:08 -0500
Message-ID: <3af55420$0$88682$45beb828@newscene.com>

problem: what is best way to deal with dupes (~1-2%) occuring in a 400 million row table (4-8 million rows are dupes). Is a group by with duped column the last one best??

we have a table with 402 million rows. It is 8 columns wide 14 gigs in size. structure: note cons_type is the column with dupes.

bill_month      date
account           number(10)
business_id     varchar2(3)
busine_spec     varchar2(5)
cons_type        varchar2(2)
cons_amount      number(6,2)

there will be several rows for each combiantion of bill_month and account, however each one should have a unique cons_type. so for example the correct one will be: note unique t1, t2, t3

12/01/00  3433  ed    we        t1  43.22
12/01/00  3433  ed    we        t2  53.22
12/01/00  3433  ed    we        t3  563.22

However around 1-2% are dupes of cons_type and we need them to group them to get rid of duped cons_type

BAD row is : note that there are 2 t2

12/01/00  3433  ed    we        t1  43.22
12/01/00  3433  ed    we        t2  53.22
12/01/00  3433  ed    we        t2  412.00

should become:

12/01/00  3433  ed    we        t1  43.22
12/01/00  3433  ed    we        t2  465.22

It seems we need to
group by bill_month,account,business_id,busine_spec,cons_type and sum(cons_amount) to get rid of dupes. But with 400 million rows its big job

Does anybody have a better idea? That can be done in reasonable amount of time?

We can do a create table on result for performance and an order by would be very nice since the current sequence of rows is not optimal. (even with indexes finding 10,000 rows spread over 400 million rows is a big job. lots of jumping from disk to disk even when oracle knows exactly where it is. having them ordered by month and account would be best)

any ideas? Received on Sun May 06 2001 - 08:42:08 CDT