WE8ISO8859P1 and quasi-"conventional" conversion to US7ASCII for searches

Home -> Community -> Usenet -> c.d.o.misc -> WE8ISO8859P1 and quasi-"conventional" conversion to US7ASCII for searches

From: Ken Ho <hoke_at_gse.harvard.edu>
Date: 3 Oct 2002 11:39:25 -0700
Message-ID: <88f8c6ea.0210031039.2fd8eb1d@posting.google.com>

Our database character set is WE8ISO8859P1, with the bulk of data in basic US7ASCII letters. One exception is names.

Different users have different knowledge of how to create accented characters (from the range chr(128) through chr(255)) on the keyboard, so for the same name, one might enter "Jose" (no accents) while another
might enter "José" (e with acute accent). Even between two people who happen to have that name, one might be in the habit of entering it with
the accent, and the other in the habit of entering it without the accent, choices which we want to respect, meaning we cannot standardize on "always with accents" or "always without accents".

This inconsistency complicates looking up data, since people looking for
"Jose" wouldn't turn up "José" if the search mechanism matches names directly on string equality.

We are considering using

CONVERT(<string>, 'US7ASCII', 'WE8ISO8859P1')

but only *after* applying several

REPLACE(<string>, <source letter>, <replacement letter(s)>)

operations (described below) as part of the search mechanism so that searches will succeed regardless of whether the user typing in the search query enters accented characters and whether the data they're looking for has accented characters.

The CONVERT operation simply strips off the accent mark for many characters (e.g., é becomes e), but it does not convert ã, ñ, or õ (with
tildes) to a, n, or o, respectively, so those would be so changed by the
preliminary REPLACE.

We would also change æ (a & e smashed together) to ae (two letters), by
using a preliminary REPLACE.

Thus, with this type of search mechanism, if someone looked up "Jose" or "José", any entry for either would be returned.

All the conversions mentioned so far are informal conversions of accented characters but they are conversions which seem sufficiently conventional here in the U.S.A. that they would be typically used by anyone who didn't know how to create the "special" character on the keyboard.

Not being sufficiently familiar with the languages from which the following characters (which I've described in case they don't come through properly) come, I am less sure about appropriate equivalents
[in

square brackets] and would appreciate comments on them:

ß (capital B with a long "tail") I believe this is a German letter which
is often converted (perhaps only informally) to "ss" for US7ASCII.
[ss?

sz?]

Ð (D with a short dash through the left vertical line) [D? TH?] ð (curly d with a short dash through its upper line, apparently the lower
case version of the immediately preceding letter) [d? th?]

Ø (O with a forward slash in it) [O? OE?] ø (o with a short forward slash in it) [o? oe?]

Does anyone know of a preferable solution to this issue, keeping in mind
that both names stored in the database and names entered for search queries could be accented or not?

-Ken Ho Received on Thu Oct 03 2002 - 13:39:25 CDT