Re: Non-duplicate test on LARGE table

From: DA Morgan <damorgan_at_psoug.org>
Date: Wed, 28 May 2008 09:26:40 -0700
Message-ID: <1211992012.855030@bubbleator.drizzle.com>


mattsteel wrote:
>
> DA Morgan wrote:
>

>> mattsteel wrote:
>>> Hello.
>>> My question is a little broad but maybe you already saw something like
>>> this ;-)
>>>
>>> Here is my scenario:
>>> Let us have a large table which contains over 20 million records and
>>> search for rows that are duplicated under some rule.
>>> The column involved in that comparison rule, named AREA, is declared
>>> VARCHAR2(4000) NOT NULL.
>>>
>>> Seeing that an usual index built on column AREA would not be so good,
>>> I tried to add a column, HASHVALUE, which always contains a suitable
>>> hash-value given by the standard "dbms_utility.get_hash_value"
>>> function.
>>>
>>> Now that I have a much smaller column to search through for
>>> duplicates, should I put an index on column HASHVALUE, do you think
>>> overall performance would be improved or not?
>>>
>>> Thank you in advance.
>>>
>>> Matt.
>> Let's see what we are missing:
>> 1. Version information
>> 2. DDL
>> 3. DML
>> 4. Explain Plan output
>>
>> About the only thing I can say is that 20M rows is not that large and
>> the answer to your question with respect to overall performance is
>> complex and can only be solved through testing.
>>
>> That said ... if all you want to do is find duplicates create a
>> disabled unique constraint, use DBMS_ERRLOG to create an error table
>> and try enabling it ... or do a simple query using GROUP BY with
>> HAVING COUNT(*) > 1.
>> --
>> Daniel A. Morgan
>> Oracle Ace Director & Instructor
>> University of Washington
>> damorgan_at_x.washington.edu (replace x with u to respond)
>> Puget Sound Oracle Users Group
>> www.psoug.org

>
>
> Thank you for replying Danel ! I agree: I'll simply try it.
> Anyway, for the missing points :-) here few lines:
>
> 1. Version information:
> Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production
> With the Partitioning, OLAP and Oracle Data Mining options
> JServer Release 9.2.0.7.0 - Production
>
> 2. DDL:
> create table interface_detail
> (ID_BATCH NUMBER(18) NOT NULL
> ,ID_ROW NUMBER(18) NOT NULL
> ,COD_ERRORE VARCHAR2(8)
> ,ID_SUBSYSTEM NUMBER(2)
> ,SEQ_ARCHIVE NUMBER(18)
> ,DATA_AREA VARCHAR2(4000) NOT NULL );
> alter table interface_detail
> add constraint ipd_pk PRIMARY KEY (id_batch, id_riga);
>
> 3. DML:
> select A.id_batch, A.id_row, B.id_batch, B.id_row
> from interface_detail A, interface_detail B
> where A.id_batch = :IDBATCH
> and B.id_batch <> A.id_batch
> and A.data_area = B.data_area;
>
> 4. Plan table
> ID PARENT OPERATION OPTIONS OPTIMIZER COST
> -- ------ ---------------- -------------- ---------- --------
> 1 0 NESTED LOOPS 45286544
> 2 1 TABLE ACCESS FULL ANALYZED 14954
> 3 1 TABLE ACCESS BY INDEX ROWID ANALYZED 161
> 4 3 INDEX RANGE SCAN ANALYZED 4
> 0 SELECT STATEMENT FIRST_ROWS 45286544
>
> Regards.
> M.

DBMS_ERRLOG didn't show up until 10.2 (that's why version info is always important) so it is not an option. I'd suggest loading a table based on records retrieved with the GROUP BY / HAVING query.

-- 
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damorgan_at_x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org
Received on Wed May 28 2008 - 11:26:40 CDT

Original text of this message