Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!postnews.google.com!p10g2000cwp.googlegroups.com!not-for-mail
From: "EdStevens" <quetico_man@yahoo.com>
Newsgroups: comp.databases.oracle.server
Subject: Re: slightly OT - cleaning up "dirty" keys?
Date: 3 Mar 2006 10:27:04 -0800
Organization: http://groups.google.com
Lines: 40
Message-ID: <1141410424.275604.240620@p10g2000cwp.googlegroups.com>
References: <4405a556$0$3558$ed2619ec@ptn-nntp-reader03.plus.net>
   <1141337909.206964.158370@t39g2000cwt.googlegroups.com>
   <4408291a$0$70321$ed2619ec@ptn-nntp-reader03.plus.net>
   <1141395060.427234.4320@j33g2000cwa.googlegroups.com>
   <44086cbe$0$70293$ed2619ec@ptn-nntp-reader03.plus.net>
NNTP-Posting-Host: 209.64.87.68
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Trace: posting.google.com 1141410429 21399 127.0.0.1 (3 Mar 2006 18:27:09 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Fri, 3 Mar 2006 18:27:09 +0000 (UTC)
In-Reply-To: <44086cbe$0$70293$ed2619ec@ptn-nntp-reader03.plus.net>
User-Agent: G2/0.2
X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6,gzip(gfe),gzip(gfe)
Complaints-To: groups-abuse@google.com
Injection-Info: p10g2000cwp.googlegroups.com; posting-host=209.64.87.68;
   posting-account=U6cb-w0AAABHLCv5oskKmlFGXS7ueXvN
Xref: dp-news.maxwell.syr.edu comp.databases.oracle.server:262722


bugbear wrote:
> EdStevens wrote:
> > And that means looking at the human factors as well.  Not knowing anything about
> > the application, I'm wonder WHY the operators are failing to find an
> > existing record and end up createing a duplicate.
>
> There's an automatic feed from external sources, which
> include all the "secondary" data (address, phone etc).
>
> If the (primary)name doesn't match, a new record is created,
> from all the fields in the external feed.
>
> Combine this with multiple external sources,
> run for 5 years, and you have a mess.
>
> That's where I'm STARTING.
>
> Now I have to "make it better".
>
>     BugBear

OK.  There are obviously a lot of things about this app we don't know.
In an earlier post you said "Different operators are FAILING to locate
the original record, and are entering a new one. "  Now you're saying
it's an automatic feed from multiple external sources.  In any event,
I'd want to know what is being done to change the way the data is
initially captured.  Because if nothing is being done to fix that, your
attempts to clean the data will be futile.

That being said, given the immediate task of identifying duplicates I'd
go back to my original suggestion of looking at EVERY column in the
database and giving thought to how it might be used.  I've already
described potential normalization of address and phone number.  Perhaps
something along the same line for name.  Are all of the name in the
format of <some variation of first name, middle initial><last name>?
If so, perhaps you could pick out the last name, then concatenate it
with the first letter of whatever passes for first name.  Maybe
concatenate that with a normalilze phone and or address ...

