Re: Selecting SIMILAR, not the same records (PROBABLE) duplicates

From: Ed Prochak <edprochak_at_gmail.com>
Date: 6 Sep 2006 08:59:33 -0700
Message-ID: <1157558372.992706.41310@e3g2000cwe.googlegroups.com>

kroger wrote:
> > kroger wrote:
> >> Hi,
> >>
> >> I've been struggling with that for two days now...
> >> There is a simple solution for finding duplicates - with GROUP BY and
> >> HAVING COUNT(*)>1 but it is not enough in my case...
> >>
> >> For the example table as follows:
> >>
> >> id || name
> >> 1 || aaa
> >> 2 || aaa xxx
> >> 3 || aaa
> >> 4 || aaah
> >> 5 || bbb
> >> 6 || bbb p
> >> 7 || ccc
>
> > If you think ID1 = ID2 and ID1=ID4 you are not looking for duplicates.
> > It is no wonder no one's proposed solution worked.
>
> I know they are not duplicates. That;s why I put SIMILAR and PROBABLE
> duplicates in the topic...
> Bad thing is, I need to have some way to display say element A (of whatever
> id 1-4) and all its PROBABLE duplicates... Tricky, but requested...
>
> > My immediate thought is that you have two choices. One is regular
> > expressions if you are in 10g or a PL/SQL procedure.
>
> I'm using 9i unfortunately :( I was considering doing that in PL/SQL but -as
> I mentioned, I don't rely on myself as it comes to Oracle and its goodies...
> That's why I was posting this problem...
>
> >
> > Without knowing the business rules it is impossible to tell what
> > would be required.
>
> Since I cannot move to 10g, PL/SQL, I suppose..
>
> >
> > But by any normal definition the normal the only duplicates are
> > ID1 and ID3 and I'd solve that problem by creating a unique constraint
> > and dumping violations to an error table.
>
> As I mentioned, it's not about dumping duplicates, but displaying possible
> user faults when entering data for further verification...
>
> Thanks a lot, anyway!
> BR,
> Kroger

having done similar work for a client, you will be best off using PL/SQL or other procedural language (I also used PERL on that project.)

IF you must use only a SQL solution, then you need some preparation work. Assuming this is a spelling issue, then you create a spelling correction table. One column is the misspelling, and the second column is the correct spelling. As long as the mispelling always maps to only one correct spelling then this works. Otherwise you need other intervention to bring more context into play, which means at least more columns to the spelling table. (Consider that if the data you are trying to match is a set of street names, then the context of the street name is the city. So City would be a column in the spelling table and in your query. For example the misspelled street FAR is FAIR in A city, but it is FARE in B city.)

HTH,
Ed
(Send me an email if this discussion does not help.) Received on Wed Sep 06 2006 - 10:59:33 CDT