Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: Group by of 400 million rows to deal with dupes is it best way?

Re: Group by of 400 million rows to deal with dupes is it best way?

From: Steve Long <steven.long_at_erols.com>
Date: Sun, 6 May 2001 21:02:29 -0400
Message-ID: <9d4scj$ncs$1@bob.news.rcn.net>

write a PL/SQL program to perform this process. in PL/SQL you can commit every N rows using a counter. i believe this is an optimal solution, all things being considered.

"sdfsd" <sdfes_at_dsf.com> wrote in message news:3af55420$0$88682$45beb828_at_newscene.com...
>
> problem: what is best way to deal with dupes (~1-2%) occuring in a 400
> million row table (4-8 million rows are dupes). Is a group by with duped
> column the last one best??
>
> we have a table with 402 million rows. It is 8 columns wide 14 gigs in
 size.
> structure: note cons_type is the column with dupes.
>
> bill_month date
> account number(10)
> business_id varchar2(3)
> busine_spec varchar2(5)
> cons_type varchar2(2)
> cons_amount number(6,2)
>
> there will be several rows for each combiantion of bill_month and account,
> however each one should have a unique cons_type. so for example the
 correct
> one will be: note unique t1, t2, t3
>
> 12/01/00 3433 ed we t1 43.22
> 12/01/00 3433 ed we t2 53.22
> 12/01/00 3433 ed we t3 563.22
>
> However around 1-2% are dupes of cons_type and we need them to group them
 to
> get rid of duped cons_type
>
> BAD row is : note that there are 2 t2
>
> 12/01/00 3433 ed we t1 43.22
> 12/01/00 3433 ed we t2 53.22
> 12/01/00 3433 ed we t2 412.00
>
> should become:
>
> 12/01/00 3433 ed we t1 43.22
> 12/01/00 3433 ed we t2 465.22
>
>
> It seems we need to
> group by bill_month,account,business_id,busine_spec,cons_type
> and sum(cons_amount) to get rid of dupes. But with 400 million rows its
 big
> job
>
> Does anybody have a better idea? That can be done in reasonable amount of
> time?
>
> We can do a create table on result for performance and an order by would
 be
> very nice since the current sequence of rows is not optimal. (even with
> indexes finding 10,000 rows spread over 400 million rows is a big job.
 lots
> of jumping from disk to disk even when oracle knows exactly where it is.
> having them ordered by month and account would be best)
>
> any ideas?
>
>
>
>
Received on Sun May 06 2001 - 20:02:29 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US