Re: Group by of 400 million rows to deal with dupes is it best way?

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Mon, 7 May 2001 10:38:25 +0100
Message-ID: <989228112.15292.0.nnrp-10.9e984b29@news.demon.co.uk>

You may want to think strategically on this one rather than just finding a solution to first problem.

14 GB is a nuisance - can you partition the table by bill-month to get a more manageable unit size without adding too much overhead. De-duplicate one month at a time might be a lot easier to handle. This could also improve your query performance if your queries tended to be focused on bill_month.

After you have eliminated duplicates, do you want to maintain uniqueness over the given columns ? If so you probably need a unique/primary constraint, and this could affect the way you solve the current problem.

One detail to consider is that a GROUP BY that reduces the data volume by only a small amount is usually not very effective - and tends to get worse if you make it run parallel.

One idea:
Create a unique or pk constraint on the target columns - make the constraint deferrable, so that it is built with a non-unique index (8.1 feature), and create it disabled, nonvalidated.

Try to enable the constraint, reporting
exception into another table. This will copy the rowids of all the rows which
breach uniqueness.

Create a table which does the GROUP BY
joining the rowid list to the base table - so the sort/group targets only 4-8M rows. (Aiming for a hash join might be the best path).

Delete from the base table where rowid in (select stored rowid).

Insert into base table results from created sub-total table.

Enabled the unique / primary key constraint.

--
Jonathan Lewis
Yet another Oracle-related web site:  http://www.jlcomp.demon.co.uk

Practical Oracle 8i:  Building Efficient Databases
Publishers:  Addison-Wesley

Reviews at: http://www.jlcomp.demon.co.uk/book_rev.html



sdfsd wrote in message <3af55420$0$88682$45beb828_at_newscene.com>...


>

>problem:  what is best way to deal with dupes (~1-2%) occuring in a 400

>million row table (4-8 million rows are dupes).   Is a group by with duped

>column the last one best??

>

>we have a table with 402 million rows.  It is 8 columns wide 14 gigs in

 size.


>structure:  note cons_type is the column with dupes.

>

>bill_month      date

>account           number(10)

>business_id     varchar2(3)

>busine_spec     varchar2(5)

>cons_type        varchar2(2)

>cons_amount      number(6,2)

>

>there will be several rows for each combiantion of bill_month and account,

>however each one should have a unique cons_type.  so for example the

 correct


>one will be:  note unique t1, t2, t3

>

>12/01/00  3433  ed    we        t1  43.22

>12/01/00  3433  ed    we        t2  53.22

>12/01/00  3433  ed    we        t3  563.22

>

>However around 1-2% are dupes of cons_type and we need them to group them

 to


>get rid of duped cons_type

>

>BAD row is : note that there are 2 t2

>

>12/01/00  3433  ed    we        t1  43.22

>12/01/00  3433  ed    we        t2  53.22

>12/01/00  3433  ed    we        t2  412.00

>

>should become:

>

>12/01/00  3433  ed    we        t1  43.22

>12/01/00  3433  ed    we        t2  465.22

>

>

>It seems we need to

>group by  bill_month,account,business_id,busine_spec,cons_type

>and  sum(cons_amount) to get rid of dupes.  But with 400 million rows its

 big


>job

>

>Does anybody have a better idea?  That can be done in reasonable amount of

>time?

>

>We can do a create table on result for performance and an order by would be

>very nice since the current sequence of rows is not optimal. (even with

>indexes finding 10,000 rows spread over 400 million rows is a big job. lots

>of jumping from disk to disk even when oracle knows exactly where it is.

>having them ordered by month and account would be best)

>

>any ideas?

>

>

>

>

Received on Mon May 07 2001 - 04:38:25 CDT