Re: Problems with UTF8

From: Martin Berger <martin.a.berger_at_gmail.com>
Date: Tue, 14 Oct 2008 08:01:31 +0200
Message-ID: <cd8f74560810132301v354a2ac7i87c116c1026723d1@mail.gmail.com>


Hi Gokul,
I'm afraid, yo missed one part. Let me explain the (possible) character conversion chain (and how oracle acts on its part): In most environments you have more than 2 involved modules. e.g. PC <-> Terminal on DB-Server <-> sqlplus <-> Database all of them might use different character sets, some of them try to do a more or less correct conversion, others do not.

I do not know much about the conversion between PC and Terminal, but let's face the part Terminal <-> sqlplus <-> DB:

Oracle supports different character sets. It also supports different character sets on client (sqlplus) and server. in such a case one of them tries to convert the characters. If the client can do it, he is supposed to do so (for performance reasons), if he can't (for any reason) the server process is doing so.
From Oracles point of view it's much easier if the client and DB have the same character set setting. So NO conversion and NO formal check (if the character is valid [1]) is done. It's often hard to find such situations: let's assume there is a Terminal with ISO8859P1 character set, sqlplus and DB has both AL32UTF8. you try to insert a Paragraph (§ - UNICODE: 0xA7, UTF8: 0xC2A7) the terminal generates 0xA7. As there is no conversion between terminal and sqlplus, sqlplus just takes this character and passes it to the DB. DB and sqlplus have same character set, therefore no one converts here. If the statement is an insert, you inserted a wrong character into the DB, and no one raised an error (!). To make it worse, if you select from this table, it's all the same way back, you will see a § on the terminal, and also that way no one knows, something goes wrong.

To check if there is such a problem, you can use the dump() function.

In this particular case, you can run

SELECT dump('AíB',1016) from dual;

and get

DUMP('AíB',1016)



Typ=1 Len=4 CharacterSet=AL32UTF8: 41,c3,ad,42

you see 4 bytes: A => 0x41, í => 0xC3AD, B => 0x 42

If you get any other results, it's time to investigate (in an AL32UTF8 DB!).

Now to your solution:

> We finally got this working using NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1 and LANG=en_US.ISO8859-1

You still does not mention the part PC <-> Terminal. If you are using PuTTY or some similar terminal, check its settings. I'm pretty sure it is also ISO8859-1. :-)

I hope I could help you understanding this issue.

best regards,
 Martin

[1] in multibyte character sets, there are ranges which are NOT matched to characters and therefore are invalid. e.g. there is NO character 0x80: UNICODE 0x7F is matched to UTF8 0x7F, UNICODE 0x80 is matched to UTF8 0xC280
You can use http://berx.at/tools/UNICODE.HTML to convert around a little bit. it's not perfect (it accepts also false' codes) but it only generates correct ones.

--
Martin Berger
http://berxblog.blogspot.com
--
http://www.freelists.org/webpage/oracle-l
Received on Tue Oct 14 2008 - 01:01:31 CDT

Original text of this message