Re: Selecting SIMILAR, not the same records (PROBABLE) duplicates

From: DA Morgan <damorgan_at_psoug.org>
Date: Wed, 06 Sep 2006 11:15:05 -0700
Message-ID: <1157566502.884700@bubbleator.drizzle.com>

kroger wrote:

>> And if you have this what do you do?
>>
>> ID
>> 1   aaa
>> 2   aaa h
>> 3   h
>> 4   h aaa
>>
>> The request makes no business sense. Here's what I would suggest:

>
> In that case (1,2) and (3,4) are candidate duplicates. It makes business
> sense, looking at a simple example (data entered by dumb or dumber user into
> name field):
>
> -JohnSmith
> -John
> -John Smith Jr
>
> In the app we are developing, we must be able to display that those three
> entries MAY refer to the same person...
>

>> SELECT DISTINCT name
>> FROM table;
>>
>> Spool the output, send it to the manager of the department, and ask
>> them to sort it out.

>
> This wouldn't make sense because
> 1. above
> 2. data changing approximately 5-20k records per day

<MAJOR RANT>
So essentially you have no data integrity and you get to do this forever.

You are being asked to clean up a memo field by hand.

I'd be working on my resume full-time. Some lucky soul is going to get to babysit this for what will seem to be an eternity. Right now the target is painted on the front of your shirt. </MAJOR RANT>

-- 
Puget Sound Oracle Users Group

Received on Wed Sep 06 2006 - 13:15:05 CDT