Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: Detecting typos

Re: Detecting typos

From: CE <charlie3101_at_hotmail.com>
Date: 7 Jan 2005 02:38:45 -0800
Message-ID: <1105094325.518342.283170@z14g2000cwz.googlegroups.com>

CE wrote:
> Hi,
>
> I'm working on a project which involves matching customers against an
> existing database using (amonst other things) address, date of birth
> and name.
>
> One problem I have to overcome is allowing for typographical errors.
> E.g. matching "CATHERINE" with "CSTHERINE". So what I though I'd do
is
> have a little function to compare 2 strings and return the number of
> differences:
>
> CREATE OR REPLACE function string_compare
> (string1 IN VARCHAR2,
> string2 IN VARCHAR2) RETURN NUMBER IS
> diffs BINARY_INTEGER := 0;
> BEGIN
> IF least(length(string1),length(string2)) > 0 THEN
> FOR i IN 1..least(length(string1),length(string2)) LOOP
> IF substr(string1,i,1) <> substr(string2,i,1) THEN
> diffs := diffs + 1;
> END IF;
> END LOOP;
> END IF;
> RETURN diffs;
> END;
> /
>
> This works ok (I've got to watch out for comparing MARY and MARK etc,
> though), but doesn't handle a comparison where there might be a
missing
> letter (e.g. "CATHERINE" and "CTHERINE"). Can anybody think of
> anything a little bit cleverer?
>
> Also if anyone could recommend a book on the rationale/logic behind
> such customer matching/de-duplicating, I'd also be grateful.
> Thanks
>
> CE

For completeness sake, this is the function I knocked up in the end. It returns a 1 (as opposed to a 0) if it's an exact match of if there is a single typographical error which I've defined as one of

1) Missing or added character (e.g. CHARLIE and CHARLI)
2) One character incorrect (e.g. CHARLIE and CHARLIW)
3) Pair of characters transposed (e.g. CHARLIE and CHALRIE)

CREATE OR REPLACE function is_typo
(p_string1 IN VARCHAR2,

p_string2 IN VARCHAR2) RETURN NUMBER IS
v_differences BINARY_INTEGER := 0;
v_string1 VARCHAR2(100) := p_string1;
v_string2 VARCHAR2(100) := p_string2;
v_is_typo NUMBER := 0;

BEGIN
IF v_string1 = v_string2 THEN
v_is_typo := 1;
ELSIF LENGTH(v_string1) = LENGTH(v_string2) THEN FOR i IN 1..LENGTH(v_string1) LOOP
IF substr(v_string1,i,1) <> substr(v_string2,i,1) THEN v_differences := v_differences + 1;
IF v_differences > 1 THEN
EXIT; -- Quit the loop as soon as > 1 difference END IF;
END IF;
END LOOP;
IF v_differences < 2 THEN
v_is_typo := 1;
END IF;
IF v_is_typo <> 1 THEN
FOR i IN 1..LENGTH(v_string1)-1 LOOP
IF SUBSTR(v_string1,1,i-1)||
SUBSTR(v_string1,i+1,1)||
SUBSTR(v_string1,i,1)||
SUBSTR(v_string1,i+2) = v_string2 THEN

v_is_typo := 1;
EXIT; -- Quit the loop as soon as a match is found END IF;
END LOOP;
END IF;
ELSIF ABS(LENGTH(v_string1) - LENGTH(v_string2)) = 1 THEN -- Put the longer string in v_string1
IF LENGTH(v_string1) < LENGTH(v_string2) THEN v_string1 := p_string2;
v_string1 := p_string1;
END IF;
FOR i IN 1..LENGTH(v_string1) LOOP
IF SUBSTR(v_string1,1,i-1)||
SUBSTR(v_string1,i+1) = v_string2 THEN
v_is_typo := 1;
EXIT; -- Quit the loop as soon as a match is found END IF;
END LOOP;
END IF;
   RETURN v_is_typo;
END is_typo;
/

Regards

CE Received on Fri Jan 07 2005 - 04:38:45 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US