Re: de-dup process

From: Gints Plivna <gints.plivna_at_gmail.com>
Date: Thu, 14 Dec 2006 16:19:29 +0200
Message-ID: <6e49b6d00612140619r4a2e862epdc4906260f41dd7@mail.gmail.com>

Ok I've created a test case on our development box with 3 striped SATA disks, 2 CPUs and 2G of RAM, but only 500M are devoted to this 9.2 Oracle instance.

base_table was created from dba_source with one unique column and text just to fill up some space
Then I multiplied base_table again and again until the count reached 119 258 304 rows. Space it took was about ~30GB.

Then I created temp_table with 20M rows to simulate your loaded table. half of them i.e. 10M rows where different than those in base_table, other 10M were the same.

Then I created unique index on base_table varchar2 column simalting your unique key and altered this column to not null.

OK. Now I was ready to do the insert.

Firstly I've used following insert

insert /*+ append */ into base_table
select * from temp_table where ukrow in ( select ukrow from temp_table
minus
select ukrow from base_table);
with plan you can find in the link at very end of this message

It took 26 minutes and 11 seconds to complete and it consisted of following steps show in v$session_longops Seconds Opname Target

281 Sort Output
18 Sort/Merge
316 Hash Join
246 Sort Output
261 Index Fast Full Scan GINTS.BASE_TABLE 525 Sort Output
37 Sort/Merge

177 Table Scan         GINTS.TEMP_TABLE
217 Table Scan         GINTS.TEMP_TABLE

The thing I didn't like was at least that GINTS.TEMP_TABLE was scanned twice. Writing SQLs I tend to follow the principle scan any necessary row as less as possible, at best only once. So after the first scan of GINTS.TEMP_TABLE I actaully knew all data for insert and second scan was a simple waste.

Then I rewrote my insert as follows:
insert /*+ append */ into base_table
select ukrow, text from (
select t.ukrow, t.text,
case when b.ukrow = t.ukrow then 0 else 1 end flag from temp_table t, base_table b
where t.ukrow = b.ukrow(+)
)
where flag = 1;

with plan you can find in the link at very end of this message

It took 16 minutes and 35 seconds to complete and it consisted of following steps show in v$session_longops Seconds Opname Target
196 Sort Output
39 Sort/Merge
363 Hash Join

244 Index Fast Full Scan         GINTS.BASE_TABLE
147 Table Scan                       GINTS.TEMP_TABLE

So summary is
10 M rows out of 20 M rows to insert in ~120 M rows table took 16.5 minutes, and it included also 1 UK index update.

So looking in the insert we can see that it is directly dependent on temp_table size, base_table UK index size and involves 1 hash and 1 sort. To my mind it should scale rather linearly or at least the in the same rate as worst of sort and hash join.

One more thing - consider using workarea_size_policy = manual with big sort and hahs area_sizes instead of workarea_size_policy = auto and [possibly] miserable pga_aggregate_target.

Gints Plivna
http://www.gplivna.eu

P.S. resent message because the body was too big. The full spool file you can find in http://www.gplivna.eu/TabInsert.txt

2006/12/14, A Ebadi <ebadi01_at_yahoo.com>:
> Biggest problem we've faced in coming up with a solution is none of the
> solutions so far scale.

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Dec 14 2006 - 08:19:29 CST