Re: Unexpected conversion from AL32UTF8 to WE8ISO8859P1

From: Jaap W. van Dijk <j.w.vandijk.removethis_at_hetnet.nl>
Date: Wed, 14 Jan 2009 07:48:50 GMT
Message-ID: <496d969d.309812_at_news.hetnet.nl>



On 13 Jan 2009 16:38:54 GMT, Laurenz Albe <invite_at_spam.to.invalid> wrote:

>Jaap W. van Dijk <j.w.vandijk_at_hetnet.nl> wrote:
>> This is my testcase (comments between square brackets):
>>
>> [Logging on to the source database:]
>>
>> SQL*Plus: Release 10.1.0.2.0 - Production on Vr Jan 9 08:40:14 2009
>>
>> Copyright (c) 1982, 2004, Oracle. All rights reserved.
>>
>>
>> Connected to:
>> Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
>> With the Partitioning, OLAP and Data Mining options
>>
>> SQL> select *
>> 2 from v$nls_parameters
>> 3 where parameter = 'NLS_CHARACTERSET';
>>
>> PARAMETER
>> ----------------------------------------------------------------
>> VALUE
>> ----------------------------------------------------------------
>> NLS_CHARACTERSET
>> AL32UTF8
>
>[...]
>
>> [Now I create and fill the table at the source database:]
>>
>> SQL> create table test_jvd (teken varchar2(1 character),tekst varchar2
>> (100));
>>
>> Table created.
>>
>> SQL> insert into test_jvd (teken,tekst) values (chr(49765),'Small E
>> acute');
>>
>> 1 row created.
>>
>> SQL> commit;
>>
>> Commit complete.
>>
>> SQL> select dump(teken)
>> 2 from test_jvd;
>>
>> DUMP(TEKEN)
>> ------------------------------------------------------------------------
>> Typ=1 Len=2: 194,101
>
>Ok, now I know what your problem is!
>
>CHR(49765) is *NOT* a small e acute.
>
>It is in fact an invalid UTF-8 sequence, and if Oracle were not so
>incredibly sloppy about encoding, it would report an error.
>
>As it is, it stores garbage in the table, and this garbage will turn to
>question marks as soon as a conversion takes place.
>
>A small e acute in UTF-8 is actually 50089.
>
>Try again with CHR(50089) and voila! it will work.
>
>Yours,
>Laurenz Albe

Hi Laurenz,

Thanks for diving into this! In an other posting in this thread I came to the same conclusion. The list with erroneous codes that was handed to me put me on the wrong track. I had some trouble determining that the codes were wrong because I could not find a list of correct utf-8 codes for 1 and 2 bytes on the internet. Unicode code point are easily found, but it wasn't until I found a document with the conversion algorithm from unicode codepoint to UTF-8 byte representations that I discovered the the codes handed to me were wrong.

How did you come by your information on valid UTF-8 codes?

Regards,
Jaap. Received on Wed Jan 14 2009 - 01:48:50 CST

Original text of this message