Re: [SQL] Comparing records, selecting them by two

From: -=< a q u a b u b b l e >=- <aquabubble_at_Remove.No.Parma.Ham.Remove.geocities.com>
Date: Fri, 29 Oct 1999 20:14:40 +0100
Message-ID: <7vcrn9$af9$1@news4.svr.pol.co.uk>

Jean-Christophe Boggio <cat_at_creaweb.fr> wrote in message news:7vbnn6$3ja$1_at_jaydee.iway.fr...
> Hello,
>
> I have a 'clients' database that has the following main fields :
>
> id number (unique)
> name char(25)
> postcode char(10)
> phone char(20)
> valid char
>
> I want to find the duplicate records but not to wildly delete them : I
must
> compare the records two by two and keep one or the two of them but only
> human can decide. The user can decide which fields will be compared : for
> example, persons who have the same phone number are probably the same
> persons but sometimes phone is left blank so the user has to compare on
> other fields.
> The valid field means that some records are already validated and the new
> ones are compared to the validated ones.
>
> For the moment, my query looks like :
>
> CREATE TABLE foo (N1,N2) as
> SELECT c.id as N1,d.id as N2
> FROM clients c, clients d
> WHERE (c.id > d.id) and (c.name=d.name) and (c.postcode=d.postcode) and
> (c.valid='Y') and (d.valid='N');
>
> This creates a table FOO with two fields being the unique IDs for the
> records to compare so I can load the two records and the user can visually
> compare them. The result is good but it takes ages to complete. My client
> says that the same duplicate search in M$ACCESS takes 40seconds and 4hours
> on Oracle :-((( (Same machine : P3/500, 512MB RAM)
>
> There are indexes for the ID, NAME, POSTCODE and PHONE fields. Being new
to
> Oracle, I don't now how to tune it, maybe I should do some simple things
> like modify the cache size ? I've read a few things about analyzing the
> tables but I don't know how it works at all. Is the tool supplied with
> Oracle ?

You can analyze tables by:

ANALYZE TABLE (ESTIMATE|COMPUTE) STATISTICS SAMPLE NNNNN; This will give the cost based optimiser a better idea as to what you table is like (number of rows... number of distinct values etc...). It may be that the CBO is choosing to use indexes when a full table scan would be better suited... the statistics will help it choose better. If you know you are only retrieving a small set of the data (< 20%) and your table is not very wide, then an index lookup is better. More rows and a full table scan would give faster results. Look at the explain plan for your queries to find out. If this doesn't give you the correct execution plan then you may be forced to use an optimiser hint, to tell the optimiser how it should access your tables. Oh, and you need to have the ANALYZE ANY TABLE privelages.

You should only increase your buffer size if you know that it is not performing well already. You should look for buffer hit ratios of above 90% for an optimally performing query.

> Well, I started digging a little more in my SQL code and came up with this
> starting query :
>
> select id,name,postcode from clients where name||postcode in
> (select name||postcode from clients group by name||postcode having
> count(*)>1);
>
> This works but :
> - doesn't check for the valid field
> - does not return two IDs per row
> - don't know about speed (I work on a 50000 recs subset of my client's
> data).

Using the concatenation as you have would disable the indexes anyway. Perhaps you should consider the following:

SELECT id,

         name,
         postcode,
         valid,
         phone

FROM clients c1
WHERE EXISTS (SELECT c2.name, c2.postcode

                   FROM clients c2
                  WHERE c1.name = c2.name
                    AND c1.postcode = c2.postcode
               GROUP BY c2.name, c2.postcode
                 HAVING count(c2.id) = 2)

ORDER BY name, postcode;

Would give you rows in pairs of your duplicates.

SELECT c1.id,

c2.id
FROM clients c1,

clients c2
WHERE c1.name = c2.name

     AND c1.postcode = c2.postcode
     AND EXISTS (SELECT c3.name, c3.postcode
                   FROM clients c3
                  WHERE c1.name = c3.name
                    AND c1.postcode = c3.postcode
               GROUP BY c3.name, c3.postcode
                 HAVING count(c3.id) = 2);

Will give you your duplicate ids side by side.

> I thought about this also (pseudocode) :
>
> create view foo as
> select name,postcode from clients group by name,postcode having
> count(*)>1;
>
> This gives me a list of duplicate values (I don't get all the records,
only
> the duplicated values).
> What I need then is some way to create the table containing all the
> duplicated couples : The idea is to separate the two processes :
> - one process creates the duplicate values (select * from foo)
> - the user selects one of them
> - the program will then search for all the records that match the
duplicate
> values. Something like :
> SELECT c.id,d.id
> FROM clients c, clients d
> WHERE c.id>d.id
> AND (c.name in (select name from foo)) AND (c.postcode in (select postcode
> from foo))
> AND (d.name in (select name from foo)) AND (d.postcode in (select postcode
> from foo));
>
> This is ugly and probably VERY VERY slow (and probably won't work) but you
> see the point.
>
> I hope you get an idea of what I want to do and you can help me. Maybe
> there's a much simpler solution or another way to design it ?

HTH Received on Fri Oct 29 1999 - 14:14:40 CDT