Re: WE8ISO8859P1 convert to AL32UTF8 unicode character set question

From: Laurenz Albe <invite_at_spam.to.invalid>
Date: Wed, 8 Apr 2009 10:31:31 +0200
Message-ID: <1239179516.842170_at_proxy.dienste.wien.at>

lsllcm wrote:
>>> I have one question about CSSCAN when I do conversion from
>>> WE8ISO8859P1 to AL32UTF8 unicode character set.
>>>
>>> create table aaa (c1 varchar2(10));
>>> insert into aaa values ('sys.�Med');
>>>
>>> After csscan the data, get the application exception data as below
>>>

[...]
>>>AAAYtgAAEAAAEKPAAA lossy conversion sys..Med
[...]
>>> Below is dump of data
>>>

[...]
>>> select dump(c1) from aaa
>>> DUMP(C1)
>>> ----------------------------------------------------
>>> Typ=1 Len=8:
>>> 115,121,115,46,133,77,101,100
>>>
>>> Is report of cssacn not correct?
>>
>> The report of csscan is correct.
>>
>> The problem is the fifth character. In the insert statement above it looks
>> like an ellipsis (three dots), and it ends up as 133 or hex 85 in your
>> database, so I assume that you mean UNICODE hex 2026, the horizontal
>> ellipsis, which is hex 85 in the Windows character set WE8MSWIN1252.
>>
>> The problem is that your database character set is not WE8MSWIN1252,
>> where everything would be fine, but WE8ISO8859P1, where hex 85
>> does not exist.
>>
>> So csscan warns you that if you convert this to AL32UTF8, hex 85,
>> which is an illegal value that has no meaning in WE8ISO8859P1,
>> will be converted to the default replacement character �
>>
>> Your mistake is probably a wrong client character set.
>> Your NLS_LANG contains WE8ISO8859P1, and your client operating
>> system is Windows, correct?
>> That would mean that you are suffering from an Oracle "bug", namely
>> that character input is not checked for validity when client character set
>> and server character set are the same.
>> That way you end up with currupt data in your database.
>
>
> Thanks for your comments:
>
> Below is my test case
>

[...]
> 1.3 exp userid=jacky/jacky file=aaa.dmp tables=aaa
>
> 2. in AL32UTF characterset db on another linux platform
> 2.1 set NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
> 2.2 imp userid=system file=aaa.dmp fromuser=jacky touser=jacky ignore=y
> 2.3 select dump(convert(c1,'AL32UTF8')) from aaa
> DUMP(CONVERT(C1,'AL32UTF8'))
> ----------------------------------------------------
> Typ=1 Len=9:
> 115,121,115,46,194,133,77,101,100

Hmm, why convert()?
If the database where you imported is AL32UTF8, the data should already be in UTF-8 encoding.
Do you get the same result if you omit convert()? I would expect so.

> After converted, the char with dump number "133" is converted to char
> with dump number "194,133". It is not same as the char default
> replacement character �.

You are right (I admit that I didn't test it). Obviously Oracle interprets hex 85 in WE8ISO8859P1 as UNICODE hex 85, which is a control character ("next line").

> I am not sure if "133" in WE8ISO8859P1 can map to "194,133" in
> AL32UTF8.
> OR
> if "133" in WE8MSWIN1252 can map to "194,133" in AL32UTF8.

UNICODE hex 85 can be converted to AL32UTF8, and the result is hex C285 or binary 194,133.

The problem is that this is NOT the "horizontal ellipsis" character that you want.
And this is what csscan complains about: Hex 85 is not a correct ISO8859-1 character. In the light of your exp/imp experiment, I would say that you are right to be suspicious of the csscan warning, because Oracle contradicts itself here: on the one hand it warns you (correctly) that hex 85 is not a valid character, on the other hand it happily converts it into UNICODE "next line".

> But from result in UI, they are same.
>
> Could you help review it again?

Ok, the case hase become a little more convoluted, but the basic things have not changed.

You see "correct" results in your user interface, but the data in the database are wrong.

You store a Windows "horizontal ellipsis" character in a WE8ISO8859P1 database, but your client character set is not set to the (correct) WE8MSWIN1252 character set, so Oracle interprets the hex 85 that it gets from Windows as a WE8ISO8859P1 character, that is (depending on viewpoint) either garbage or a control character, but certainly not a horizontal ellipsis.

When csscan sees this, it judges the character as garbage. exp/imp, however, interprets it as control character and translates it to AL32UTF8 as such.

As long as you keep your Oracle client character set to WE8ISO8859P1 *and* interpret the resulting character as Windows character, everything will appear to be fine. Basically you're doing the reverse of the above.

But it is a sleeping time bomb.
For example, you will never be able to use this database with JDBC, because Java cannot use WE8ISO8859P1.

The character you want (horizontal ellipsis) is hex E280A6 (or decimal 226,128,166) in AL32UTF8, and that is what should be in your database if everything had gone correctly.

I recommend that you do the following (provided you have only Windows characters in your database:

Backup the original database.
Change the database character set from WE8ISO8859P1 to WE8MSWIN1252 using CSALTER as described in http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14225/ch12scanner.htm#sthref1702
Verify that csscan does not report any problems now.
Now use exp/imp to transfer the data into your AL32UTF8 database.
Change your client character set to WE8MSWIN1252.

Then everything should look ok again, and in addition it *is* ok.

Yours,
Laurenz Albe Received on Wed Apr 08 2009 - 03:31:31 CDT

This message: [ Message body ]
Next message: Hemant K Chitale: "Re: UNDO: 10g-style behaviour in 9.2.0.8?"
Previous message: Phil H: "Re: Query to find first missing integer in a field"
Maybe in reply to: lsllcm: "WE8ISO8859P1 convert to AL32UTF8 unicode character set question"
Next in thread: lsllcm: "Re: WE8ISO8859P1 convert to AL32UTF8 unicode character set question"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message