Re: Non-duplicate test on LARGE table
Date: Wed, 28 May 2008 01:53:38 -0700 (PDT)
Message-ID: <66960d95-d288-41bb-8266-9f3e71d31633@k37g2000hsf.googlegroups.com>
DA Morgan wrote:
> mattsteel wrote:
> > Hello.
> > My question is a little broad but maybe you already saw something like
> > this ;-)
> >
> > Here is my scenario:
> > Let us have a large table which contains over 20 million records and
> > search for rows that are duplicated under some rule.
> > The column involved in that comparison rule, named AREA, is declared
> > VARCHAR2(4000) NOT NULL.
> >
> > Seeing that an usual index built on column AREA would not be so good,
> > I tried to add a column, HASHVALUE, which always contains a suitable
> > hash-value given by the standard "dbms_utility.get_hash_value"
> > function.
> >
> > Now that I have a much smaller column to search through for
> > duplicates, should I put an index on column HASHVALUE, do you think
> > overall performance would be improved or not?
> >
> > Thank you in advance.
> >
> > Matt.
>
> Let's see what we are missing:
> 1. Version information
> 2. DDL
> 3. DML
> 4. Explain Plan output
>
> About the only thing I can say is that 20M rows is not that large and
> the answer to your question with respect to overall performance is
> complex and can only be solved through testing.
>
> That said ... if all you want to do is find duplicates create a
> disabled unique constraint, use DBMS_ERRLOG to create an error table
> and try enabling it ... or do a simple query using GROUP BY with
> HAVING COUNT(*) > 1.
> --
> Daniel A. Morgan
> Oracle Ace Director & Instructor
> University of Washington
> damorgan_at_x.washington.edu (replace x with u to respond)
> Puget Sound Oracle Users Group
> www.psoug.org
Thank you for replying Danel ! I agree: I'll simply try it. Anyway, for the missing points :-) here few lines:
- Version information: Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.7.0 - Production
- DDL: create table interface_detail (ID_BATCH NUMBER(18) NOT NULL ,ID_ROW NUMBER(18) NOT NULL ,COD_ERRORE VARCHAR2(8) ,ID_SUBSYSTEM NUMBER(2) ,SEQ_ARCHIVE NUMBER(18) ,DATA_AREA VARCHAR2(4000) NOT NULL ); alter table interface_detail add constraint ipd_pk PRIMARY KEY (id_batch, id_riga);
- DML: select A.id_batch, A.id_row, B.id_batch, B.id_row from interface_detail A, interface_detail B where A.id_batch = :IDBATCH and B.id_batch <> A.id_batch and A.data_area = B.data_area;
- Plan table ID PARENT OPERATION OPTIONS OPTIMIZER COST -- ------ ---------------- -------------- ---------- -------- 1 0 NESTED LOOPS 45286544 2 1 TABLE ACCESS FULL ANALYZED 14954 3 1 TABLE ACCESS BY INDEX ROWID ANALYZED 161 4 3 INDEX RANGE SCAN ANALYZED 4 0 SELECT STATEMENT FIRST_ROWS 45286544
M. Received on Wed May 28 2008 - 03:53:38 CDT