Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!postnews.google.com!t39g2000cwt.googlegroups.com!not-for-mail
From: "Joel Garry" <joel-garry@home.com>
Newsgroups: comp.databases.oracle.server
Subject: Re: slightly OT - cleaning up "dirty" keys?
Date: 3 Mar 2006 11:38:53 -0800
Organization: http://groups.google.com
Lines: 49
Message-ID: <1141414733.240545.310130@t39g2000cwt.googlegroups.com>
References: <4405a556$0$3558$ed2619ec@ptn-nntp-reader03.plus.net>
   <1141337909.206964.158370@t39g2000cwt.googlegroups.com>
   <4408291a$0$70321$ed2619ec@ptn-nntp-reader03.plus.net>
   <1141395060.427234.4320@j33g2000cwa.googlegroups.com>
   <44086cbe$0$70293$ed2619ec@ptn-nntp-reader03.plus.net>
NNTP-Posting-Host: 67.112.255.226
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Trace: posting.google.com 1141414738 31934 127.0.0.1 (3 Mar 2006 19:38:58 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Fri, 3 Mar 2006 19:38:58 +0000 (UTC)
In-Reply-To: <44086cbe$0$70293$ed2619ec@ptn-nntp-reader03.plus.net>
User-Agent: G2/0.2
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0),gzip(gfe),gzip(gfe)
X-HTTP-Via: 1.0 ISA-OC2
Complaints-To: groups-abuse@google.com
Injection-Info: t39g2000cwt.googlegroups.com; posting-host=67.112.255.226;
   posting-account=YRNZ5wwAAAAg-yYjMFwy3FyWUbPiwNdq
Xref: dp-news.maxwell.syr.edu comp.databases.oracle.server:262723


bugbear wrote:
> EdStevens wrote:
> > And that means looking at the human factors as well.  Not knowing anything about
> > the application, I'm wonder WHY the operators are failing to find an
> > existing record and end up createing a duplicate.
>
> There's an automatic feed from external sources, which
> include all the "secondary" data (address, phone etc).
>
> If the (primary)name doesn't match, a new record is created,
> from all the fields in the external feed.
>
> Combine this with multiple external sources,
> run for 5 years, and you have a mess.
>
> That's where I'm STARTING.
>
> Now I have to "make it better".

Charge by the hour!  :-)

Ed has good ideas, I'd add there are really a number of disparate
problems here that should be addressed individually.  To start, you
need to define the range of possibilities, based on your feeds (I'm not
asking you to post them!).

So for example, my initial Having suggestion would work on a subset of
dirt, those where the external feed happens to give the same address
(or phone number or whatever) when creating a row with a new name.  It
would also work for identical names - err, how do you handle identical
names from different people, anyways?

The pseudocode in the edit distance link you gave would translate quite
simply to awk, which I would highly recommend for these sorts of
cleansings, I've always been amazed at how efficient it is, from back
in the olden days, and it is a language designed for this type of
thing.  If you don't already know it, see the awk book by Aho Kernighan
and Weinberger (and I'm sure there are a few others if you need).

The goal of course is to winnow down to a list of possibilities that
someone can look at and say yeah or nay.  Then make a more reasonable
key!

jg
--
@home.com is bogus.
http://64.233.179.104/custom?q=cache:OUIfJJWtJI8J:www.phpbbserver.com/phpbb/viewtopic.php%3Ft%3D189%26mforum%3Ddizwellforum%26+natural+keys&hl=en&gl=us&ct=clnk&cd=1&ie=UTF-8

