[SQL] Comparing records, selecting them by two

From: Jean-Christophe Boggio <cat_at_creaweb.fr>
Date: Fri, 29 Oct 1999 11:00:18 +0200
Message-ID: <7vbnn6$3ja$1@jaydee.iway.fr>

Hello,

I have a 'clients' database that has the following main fields :

id number (unique)
name char(25)
postcode char(10)
phone char(20)
valid char

I want to find the duplicate records but not to wildly delete them : I must compare the records two by two and keep one or the two of them but only human can decide. The user can decide which fields will be compared : for example, persons who have the same phone number are probably the same persons but sometimes phone is left blank so the user has to compare on other fields.
The valid field means that some records are already validated and the new ones are compared to the validated ones.

For the moment, my query looks like :

CREATE TABLE foo (N1,N2) as
SELECT c.id as N1,d.id as N2
FROM clients c, clients d
WHERE (c.id > d.id) and (c.name=d.name) and (c.postcode=d.postcode) and (c.valid='Y') and (d.valid='N');

This creates a table FOO with two fields being the unique IDs for the records to compare so I can load the two records and the user can visually compare them. The result is good but it takes ages to complete. My client says that the same duplicate search in M$ACCESS takes 40seconds and 4hours on Oracle :-((( (Same machine : P3/500, 512MB RAM)

There are indexes for the ID, NAME, POSTCODE and PHONE fields. Being new to Oracle, I don't now how to tune it, maybe I should do some simple things like modify the cache size ? I've read a few things about analyzing the tables but I don't know how it works at all. Is the tool supplied with Oracle ?

Well, I started digging a little more in my SQL code and came up with this starting query :

select id,name,postcode from clients where name||postcode in (select name||postcode from clients group by name||postcode having count(*)>1);

This works but :
- doesn't check for the valid field

does not return two IDs per row
don't know about speed (I work on a 50000 recs subset of my client's data).

I thought about this also (pseudocode) :

create view foo as
select name,postcode from clients group by name,postcode having count(*)>1;

This gives me a list of duplicate values (I don't get all the records, only the duplicated values).
What I need then is some way to create the table containing all the duplicated couples : The idea is to separate the two processes :
- one process creates the duplicate values (select * from foo)

the user selects one of them
- the program will then search for all the records that match the duplicate
values. Something like : SELECT c.id,d.id FROM clients c, clients d WHERE c.id>d.id AND (c.name in (select name from foo)) AND (c.postcode in (select postcode from foo)) AND (d.name in (select name from foo)) AND (d.postcode in (select postcode from foo));

This is ugly and probably VERY VERY slow (and probably won't work) but you see the point.

I hope you get an idea of what I want to do and you can help me. Maybe there's a much simpler solution or another way to design it ?

*ANY* help will be appreciated.

Thanks in advance.

P.S.: We are running Oracle 8. Received on Fri Oct 29 1999 - 04:00:18 CDT