Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: slightly OT - cleaning up "dirty" keys?

Re: slightly OT - cleaning up "dirty" keys?

From: Frank van Bortel <frank.van.bortel_at_gmail.com>
Date: Thu, 02 Mar 2006 21:09:25 +0100
Message-ID: <du7ili$r7g$1@news4.zwoll1.ov.home.nl>


bugbear wrote:
> sybrandb_at_yahoo.com wrote:

>> Try the SOUNDEX function available in Oracle on your data.
>> It is heavily English-oriented, but that doesn't seem to be a problem
>> in your case.
>> SOUNDEX will provide the 'phonetic' representation of a name.

>
> That would be a useful "distance" measure.
>
> That would give me matching (of user input) under dirty data,
> but doesn't help me cluster (and re-normalise) the existing
> cruddy data.
>
> BugBear

Hardly - soundex only takes the first four characters in account, if these differ, and result in different mappings.

The algorithm is known, and public; this was taken from http://www.blogger.com/publish.g?blogID=12361937&inprogress=true:

# Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step. # Retain the first letter of the word.
# Change all occurrence of the following letters to '0' (zero):   'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. # Change letters from the following sets into the digit given:

# Remove all pairs of digits which occur beside each other from the string that resulted after step (4).
# Remove all zeros from the string that results from step 5.0 (placed there in step 3)
# Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

So, your "J Smith", "John Smith" and "J K Smith" will result in totally different soundex values.
 1* select soundex('J Smith'), soundex('John Smith'),soundex('J K Smith') from dual
SQL> / SOUN SOUN SOUN
---- ---- ----
J253 J525 J225

-- 
Regards,
Frank van Bortel

Top-posting is one way to shut me up...
Received on Thu Mar 02 2006 - 14:09:25 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US