Re: Character Set Problems

From: Laurenz Albe <invite_at_spam.to.invalid>
Date: 15 Jun 2007 13:08:08 GMT
Message-ID: <1181912853.399014@proxy.dienste.wien.at>

Paul <paulwragg2323_at_hotmail.com> wrote:

>>Is is clear to you why you always get the UTF-8 sequence as two
>>characters, no matter how you set NLS_LANG?

>
> If I am completely honest probably not. [...]
>
> The problem I have is that I believe that the client has changed from
> WE8ISO8859P1, to WE8MSWIN1252 (whereas on Oracle 8 the client matched
> the DB server). I think that the upgrade was done by exporting from 8
> and importing into 9 (which of course could have had an effect). As
> WE8MSWIN1252 is a superset of WE8ISO8859P1 I don't believe it would
> have caused a problem?

Unfortunately it is a problem, although WIN1252 is a superset of ISO8859-1, as you corretly observe.
The problem is that if client and server character set are different, character conversion takes place. Before there was no conversion, and Oracle did not perform any validity checks.

Let me illustrate this with an example:
Suppose that both database and client are WE8ISO8859P1, and you store the UTF-8 sequence like 0xe2 0x82 0xac (which is the Euro symbol). These three bytes will be stored in the database, and since the server character set is ISO8859-1, they will resemble three characters: lower case a with circumflex accent, an undefined character (0x82 has no meaning in ISO8859-1!) and the logical not symbol.

Now when you retrieve this sequence with a client that has WE8MSWIN1252 configured, these three bytes will be converted. The first and last byte will not be a problem, because as you observed, WIN-1252 is a superset of ISO8859-1. But the middle character does not correspond to any WIN-1252 character! Oracle, ill-guided, as I would say, does not give you an error, but replaces the questionable character with the "default replacement character", in that case 0xbf (inverted question mark).

You end up with the sequence 0xe2 0xbf 0xac which has no meaning (it is UNICODE 0x2fec which is undefined).

> I do not expect anybody to give me a solution, but some pointers/ideas
> on how other people would handle storing UTF8 data in a Western
> European Character Set DB would be great! I am not asking for a quick
> fix, more of a pointer so I can then go off an look into this in more
> detail to get a soltution. I just hope character sets are covered more
> in the 2nd exam!!

You can only do that by using the National Character Set. This is always a UTF-8 character set in Oracle 10g. You can then define a column als NVARCHAR2 or NCLOB and store UTF-8 data in it, even if the rest of the database has ISO8859-1.

There is no other way to store UTF-8 data in a database with a single byte character set - this is basically a contradiction in terms.

Yours,
Laurenz Albe Received on Fri Jun 15 2007 - 08:08:08 CDT