Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!postnews.google.com!p10g2000cwp.googlegroups.com!not-for-mail From: "EdStevens" Newsgroups: comp.databases.oracle.server Subject: Re: slightly OT - cleaning up "dirty" keys? Date: 3 Mar 2006 10:27:04 -0800 Organization: http://groups.google.com Lines: 40 Message-ID: <1141410424.275604.240620@p10g2000cwp.googlegroups.com> References: <4405a556$0$3558$ed2619ec@ptn-nntp-reader03.plus.net> <1141337909.206964.158370@t39g2000cwt.googlegroups.com> <4408291a$0$70321$ed2619ec@ptn-nntp-reader03.plus.net> <1141395060.427234.4320@j33g2000cwa.googlegroups.com> <44086cbe$0$70293$ed2619ec@ptn-nntp-reader03.plus.net> NNTP-Posting-Host: 209.64.87.68 Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Trace: posting.google.com 1141410429 21399 127.0.0.1 (3 Mar 2006 18:27:09 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Fri, 3 Mar 2006 18:27:09 +0000 (UTC) In-Reply-To: <44086cbe$0$70293$ed2619ec@ptn-nntp-reader03.plus.net> User-Agent: G2/0.2 X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6,gzip(gfe),gzip(gfe) Complaints-To: groups-abuse@google.com Injection-Info: p10g2000cwp.googlegroups.com; posting-host=209.64.87.68; posting-account=U6cb-w0AAABHLCv5oskKmlFGXS7ueXvN Xref: dp-news.maxwell.syr.edu comp.databases.oracle.server:262722 bugbear wrote: > EdStevens wrote: > > And that means looking at the human factors as well. Not knowing anything about > > the application, I'm wonder WHY the operators are failing to find an > > existing record and end up createing a duplicate. > > There's an automatic feed from external sources, which > include all the "secondary" data (address, phone etc). > > If the (primary)name doesn't match, a new record is created, > from all the fields in the external feed. > > Combine this with multiple external sources, > run for 5 years, and you have a mess. > > That's where I'm STARTING. > > Now I have to "make it better". > > BugBear OK. There are obviously a lot of things about this app we don't know. In an earlier post you said "Different operators are FAILING to locate the original record, and are entering a new one. " Now you're saying it's an automatic feed from multiple external sources. In any event, I'd want to know what is being done to change the way the data is initially captured. Because if nothing is being done to fix that, your attempts to clean the data will be futile. That being said, given the immediate task of identifying duplicates I'd go back to my original suggestion of looking at EVERY column in the database and giving thought to how it might be used. I've already described potential normalization of address and phone number. Perhaps something along the same line for name. Are all of the name in the format of ? If so, perhaps you could pick out the last name, then concatenate it with the first letter of whatever passes for first name. Maybe concatenate that with a normalilze phone and or address ...