Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Usenet -> c.d.o.server -> Re: Group by of 400 million rows to deal with dupes is it best way?
You may want to think strategically on this one rather than just finding a solution to first problem.
14 GB is a nuisance - can you partition the table by bill-month to get a more manageable unit size without adding too much overhead. De-duplicate one month at a time might be a lot easier to handle. This could also improve your query performance if your queries tended to be focused on bill_month.
After you have eliminated duplicates, do you want to maintain uniqueness over the given columns ? If so you probably need a unique/primary constraint, and this could affect the way you solve the current problem.
One detail to consider is that a GROUP BY that reduces the data volume by only a small amount is usually not very effective - and tends to get worse if you make it run parallel.
One idea:
Create a unique or pk constraint on the target
columns - make the constraint deferrable,
so that it is built with a non-unique index
(8.1 feature), and create it disabled, nonvalidated.
Try to enable the constraint, reporting
exception into another table. This will
copy the rowids of all the rows which
breach uniqueness.
Create a table which does the GROUP BY
joining the rowid list to the base table - so
the sort/group targets only 4-8M rows. (Aiming
for a hash join might be the best path).
Delete from the base table where rowid in (select stored rowid).
Insert into base table results from created sub-total table.
Enabled the unique / primary key constraint.
-- Jonathan Lewis Yet another Oracle-related web site: http://www.jlcomp.demon.co.uk Practical Oracle 8i: Building Efficient Databases Publishers: Addison-Wesley Reviews at: http://www.jlcomp.demon.co.uk/book_rev.html sdfsd wrote in message <3af55420$0$88682$45beb828_at_newscene.com>...Received on Mon May 07 2001 - 04:38:25 CDT
>
>problem: what is best way to deal with dupes (~1-2%) occuring in a 400
>million row table (4-8 million rows are dupes). Is a group by with duped
>column the last one best??
>
>we have a table with 402 million rows. It is 8 columns wide 14 gigs in
size.
>structure: note cons_type is the column with dupes.
>
>bill_month date
>account number(10)
>business_id varchar2(3)
>busine_spec varchar2(5)
>cons_type varchar2(2)
>cons_amount number(6,2)
>
>there will be several rows for each combiantion of bill_month and account,
>however each one should have a unique cons_type. so for example the
correct
>one will be: note unique t1, t2, t3
>
>12/01/00 3433 ed we t1 43.22
>12/01/00 3433 ed we t2 53.22
>12/01/00 3433 ed we t3 563.22
>
>However around 1-2% are dupes of cons_type and we need them to group them
to
>get rid of duped cons_type
>
>BAD row is : note that there are 2 t2
>
>12/01/00 3433 ed we t1 43.22
>12/01/00 3433 ed we t2 53.22
>12/01/00 3433 ed we t2 412.00
>
>should become:
>
>12/01/00 3433 ed we t1 43.22
>12/01/00 3433 ed we t2 465.22
>
>
>It seems we need to
>group by bill_month,account,business_id,busine_spec,cons_type
>and sum(cons_amount) to get rid of dupes. But with 400 million rows its
big
>job
>
>Does anybody have a better idea? That can be done in reasonable amount of
>time?
>
>We can do a create table on result for performance and an order by would be
>very nice since the current sequence of rows is not optimal. (even with
>indexes finding 10,000 rows spread over 400 million rows is a big job. lots
>of jumping from disk to disk even when oracle knows exactly where it is.
>having them ordered by month and account would be best)
>
>any ideas?
>
>
>
>
![]() |
![]() |