Suggestions for Advanced Detection of Duplicates

From: Michael C. Wilkerson <Mike.Wilkerson_at_uscreative.com>
Date: 2000/04/10
Message-ID: <38F232EE.F7A2A159@uscreative.com>#1/1

I appologize for not clarifying exactly what I was looking for in my previous post.

I was hoping someone could either provide some of their strategies or suggest resources that covered the detection of duplicate records in a database. Now the fields in these records may not be EXACTLY identical, but still refer to the same person.

For example, a possible scenerio would be: person A enters his information:
Name: Mike Wilkerson
Address: 810 Somewhere St. Apt. 100
City: Dallas
State: Texas
Phone: (512) 555-5555

Several weeks later the person returns but, having forgotten his username/password, he creates another new record so that he can access the app/site/whatever again (the jerk). This time, however, he enters: Name: Mike Wilkerson
Address: 810 Somewhere #100
City: Dallas
State: Texas
Phone: 512-555-5555

Now, it's the same person, but if you were to attempt to use the following query to detect if this user was already in the database, it would no return any rows:
SELECT name, address, city, state, phone from tblpeople
where name = 'Mike Wilkerson'

    and address = '810 Somewhere #100'
    and city = 'Dallas'
    and state = 'Texas'
    and phone = '512-555-5555'

There are many different ways users could enter in their information. In the above example I could have put Michael instead of Mike or TX instead of Texas, etc...

I suspect that the detection of duplicates in a database has to be a serious concern for anyone who wants to track how many distinct customers/users/etc that they have. Budgets, funding, etc could be based off of these kinds of counts, so I was hoping that there were books covering methods for this kind of thing. I've already looked at using TRANSLATE to eliminate all spaces and punctuation from the phone field. For the example above they would both get translated to just 5125555555 and it would be a strong indication that they were for the same individual. I don't really expect any method to catch 100% of the duplicates, just some documentation/suggestions that might cover 1) things to consider/watch out for and 2) example methods/strategies for detecting these kinds of duplications.

Thanks for your help,
Michael Received on Mon Apr 10 2000 - 00:00:00 CDT