Re: Non-duplicate test on LARGE table
From: DA Morgan <damorgan_at_psoug.org>
Date: Wed, 28 May 2008 09:26:40 -0700
Message-ID: <1211992012.855030@bubbleator.drizzle.com>
>
>
> Thank you for replying Danel ! I agree: I'll simply try it.
> Anyway, for the missing points :-) here few lines:
>
> 1. Version information:
> Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production
> With the Partitioning, OLAP and Oracle Data Mining options
> JServer Release 9.2.0.7.0 - Production
>
> 2. DDL:
> create table interface_detail
> (ID_BATCH NUMBER(18) NOT NULL
> ,ID_ROW NUMBER(18) NOT NULL
> ,COD_ERRORE VARCHAR2(8)
> ,ID_SUBSYSTEM NUMBER(2)
> ,SEQ_ARCHIVE NUMBER(18)
> ,DATA_AREA VARCHAR2(4000) NOT NULL );
> alter table interface_detail
> add constraint ipd_pk PRIMARY KEY (id_batch, id_riga);
>
> 3. DML:
> select A.id_batch, A.id_row, B.id_batch, B.id_row
> from interface_detail A, interface_detail B
> where A.id_batch = :IDBATCH
> and B.id_batch <> A.id_batch
> and A.data_area = B.data_area;
>
> 4. Plan table
> ID PARENT OPERATION OPTIONS OPTIMIZER COST
> -- ------ ---------------- -------------- ---------- --------
> 1 0 NESTED LOOPS 45286544
> 2 1 TABLE ACCESS FULL ANALYZED 14954
> 3 1 TABLE ACCESS BY INDEX ROWID ANALYZED 161
> 4 3 INDEX RANGE SCAN ANALYZED 4
> 0 SELECT STATEMENT FIRST_ROWS 45286544
>
> Regards.
> M.
Date: Wed, 28 May 2008 09:26:40 -0700
Message-ID: <1211992012.855030@bubbleator.drizzle.com>
mattsteel wrote:
>
> DA Morgan wrote:
>
>> mattsteel wrote: >>> Hello. >>> My question is a little broad but maybe you already saw something like >>> this ;-) >>> >>> Here is my scenario: >>> Let us have a large table which contains over 20 million records and >>> search for rows that are duplicated under some rule. >>> The column involved in that comparison rule, named AREA, is declared >>> VARCHAR2(4000) NOT NULL. >>> >>> Seeing that an usual index built on column AREA would not be so good, >>> I tried to add a column, HASHVALUE, which always contains a suitable >>> hash-value given by the standard "dbms_utility.get_hash_value" >>> function. >>> >>> Now that I have a much smaller column to search through for >>> duplicates, should I put an index on column HASHVALUE, do you think >>> overall performance would be improved or not? >>> >>> Thank you in advance. >>> >>> Matt. >> Let's see what we are missing: >> 1. Version information >> 2. DDL >> 3. DML >> 4. Explain Plan output >> >> About the only thing I can say is that 20M rows is not that large and >> the answer to your question with respect to overall performance is >> complex and can only be solved through testing. >> >> That said ... if all you want to do is find duplicates create a >> disabled unique constraint, use DBMS_ERRLOG to create an error table >> and try enabling it ... or do a simple query using GROUP BY with >> HAVING COUNT(*) > 1. >> -- >> Daniel A. Morgan >> Oracle Ace Director & Instructor >> University of Washington >> damorgan_at_x.washington.edu (replace x with u to respond) >> Puget Sound Oracle Users Group >> www.psoug.org
>
>
> Thank you for replying Danel ! I agree: I'll simply try it.
> Anyway, for the missing points :-) here few lines:
>
> 1. Version information:
> Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production
> With the Partitioning, OLAP and Oracle Data Mining options
> JServer Release 9.2.0.7.0 - Production
>
> 2. DDL:
> create table interface_detail
> (ID_BATCH NUMBER(18) NOT NULL
> ,ID_ROW NUMBER(18) NOT NULL
> ,COD_ERRORE VARCHAR2(8)
> ,ID_SUBSYSTEM NUMBER(2)
> ,SEQ_ARCHIVE NUMBER(18)
> ,DATA_AREA VARCHAR2(4000) NOT NULL );
> alter table interface_detail
> add constraint ipd_pk PRIMARY KEY (id_batch, id_riga);
>
> 3. DML:
> select A.id_batch, A.id_row, B.id_batch, B.id_row
> from interface_detail A, interface_detail B
> where A.id_batch = :IDBATCH
> and B.id_batch <> A.id_batch
> and A.data_area = B.data_area;
>
> 4. Plan table
> ID PARENT OPERATION OPTIONS OPTIMIZER COST
> -- ------ ---------------- -------------- ---------- --------
> 1 0 NESTED LOOPS 45286544
> 2 1 TABLE ACCESS FULL ANALYZED 14954
> 3 1 TABLE ACCESS BY INDEX ROWID ANALYZED 161
> 4 3 INDEX RANGE SCAN ANALYZED 4
> 0 SELECT STATEMENT FIRST_ROWS 45286544
>
> Regards.
> M.
DBMS_ERRLOG didn't show up until 10.2 (that's why version info is always important) so it is not an option. I'd suggest loading a table based on records retrieved with the GROUP BY / HAVING query.
-- Daniel A. Morgan Oracle Ace Director & Instructor University of Washington damorgan_at_x.washington.edu (replace x with u to respond) Puget Sound Oracle Users Group www.psoug.orgReceived on Wed May 28 2008 - 11:26:40 CDT