Re: slightly OT - cleaning up "dirty" keys?

From: EdStevens <quetico_man_at_yahoo.com>
Date: 3 Mar 2006 06:11:00 -0800
Message-ID: <1141395060.427234.4320@j33g2000cwa.googlegroups.com>

bugbear wrote:
> Joel Garry wrote:
> > Is there any other information that might give a clue, ie, address?
> > group by address having count(*) > 1
> >
> > Then all you have to worry about is wrong addresses :-)
>
> Heh. That's the problem.
>
> Different operators are FAILING to locate
> the original record, and are entering a new one.
> The address fields are just as "dirty" as
> th so-called (name) primary key.
>
> All the data is a freakin' dirty mess.
>
> I was hoping to cluster "likely"
> primary-keys, so a human
> could merge them, and also merge/fix the associated
> non-primary data.
>
> BugBear

Ok, I can see where names and addresses can be pretty messey. How about something else? Phone number, perhaps? Ok, so sometimes it gets area code, sometimes not, sometimes with hyphens, sometimes not. A little function to remove anything not 0-9, then just look at the remaining rightmost 7 chars?

Other random thoughts ...

Cleaning up a mess like this won't be easy, and no matter what it will require a fair amount of manual, human intelligence. But think outside the box. Surely there is *some* column that is less dirty but also row-identifiable. Look the table def over closely and give some real consideration to each column.

Might consider creating a 'shadow' table with all the columns of the original, plus some columns to hold 'normalize' values of the columns being analyzed.

Thinking back to addresses ... are they all street addresses, or are some PO boxes? I'm thinking if street addresses, perhaps you could: 1- convert to all uppercase
2 - pick off the first 'word' -- which would be the 'house number' part of the address
3 - pick out the bulk of the street name -- dropping off qualifiers like 'Street', 'Road', "Lane', etc. and their abbreviations. So that an address like '123 E. Elm St.' would yeild a normalized value of '123ELM'

I'm sure you realize by now (if you didn't already!) that setting "name" as a primary key field means you have no primary key, and this needs to be addressed in the application. (actually, if you have a column 'name' instead of 'first name' and 'last name', it suggests you've probably got a whole raft of other design problems!) And that means looking at the human factors as well. Not knowing anything about the application, I'm wonder WHY the operators are failing to find an existing record and end up createing a duplicate. Perhaps the app should always start with a lookup of possible existing records, base on a search of LAST NAME (not just 'name') or phone number. Of course, if the organization isn't willing to entertain such ideas, they need to be reminded that if they continue to do things the same way, they are going to continue to get the same unacceptable results. Received on Fri Mar 03 2006 - 08:11:00 CST