Re: Suggestions for Advanced Detection of Duplicates

From: Stanton W. Schmidt <sschmidt_at_wctc.net>
Date: 2000/04/12
Message-ID: <wiRI4.756$q2.26092459_at_news.wctc.net>#1/1


    The best you can do is the translate stuff you already talked about. But that will only give you possible duplicates. I have seen a married couple both with the first name of Jo (his was actually Joe). There might also be a father and son living at the same address with the same name (see George Foreman). If the phone number is not a required field (it often isn't) you could have people living in an apartment building with the same name. The best you can do is print out a report of some kind and have someone manually check out possible duplicates and delete the appropriate one (you may also have to change related records to relate to the correct record).
Good luck
Michael C. Wilkerson <Mike.Wilkerson_at_uscreative.com> wrote in message news:38F232EE.F7A2A159_at_uscreative.com...
> I appologize for not clarifying exactly what I was looking for in my
> previous post.
>
> I was hoping someone could either provide some of their strategies or
> suggest resources that covered the detection of duplicate records in a
> database. Now the fields in these records may not be EXACTLY identical,
> but still refer to the same person.
>
> For example, a possible scenerio would be:
> person A enters his information:
> Name: Mike Wilkerson
> Address: 810 Somewhere St. Apt. 100
> City: Dallas
> State: Texas
> Phone: (512) 555-5555
>
> Several weeks later the person returns but, having forgotten his
> username/password, he creates another new record so that he can access
> the app/site/whatever again (the jerk). This time, however, he enters:
> Name: Mike Wilkerson
> Address: 810 Somewhere #100
> City: Dallas
> State: Texas
> Phone: 512-555-5555
>
> Now, it's the same person, but if you were to attempt to use the
> following query to detect if this user was already in the database, it
> would no return any rows:
> SELECT name, address, city, state, phone
> from tblpeople
> where name = 'Mike Wilkerson'
> and address = '810 Somewhere #100'
> and city = 'Dallas'
> and state = 'Texas'
> and phone = '512-555-5555'
>
> There are many different ways users could enter in their information.
> In the above example I could have put Michael instead of Mike or TX
> instead of Texas, etc...
>
> I suspect that the detection of duplicates in a database has to be a
> serious concern for anyone who wants to track how many distinct
> customers/users/etc that they have. Budgets, funding, etc could be
> based off of these kinds of counts, so I was hoping that there were
> books covering methods for this kind of thing. I've already looked at
> using TRANSLATE to eliminate all spaces and punctuation from the phone
> field. For the example above they would both get translated to just
> 5125555555 and it would be a strong indication that they were for the
> same individual. I don't really expect any method to catch 100% of the
> duplicates, just some documentation/suggestions that might cover 1)
> things to consider/watch out for and 2) example methods/strategies for
> detecting these kinds of duplications.
>
> Thanks for your help,
> Michael
>
Received on Wed Apr 12 2000 - 00:00:00 CEST

Original text of this message