Re: UTF-8 issue with PHP app in Windows environment.

From: Laurenz Albe <invite_at_spam.to.invalid>
Date: 26 Apr 2007 12:02:39 GMT
Message-ID: <1177588955.107950@proxy.dienste.wien.at>

roman_coder <clayton.bolz_at_gmail.com> wrote:
> I have a UTF8 PHP application that is writing a string containing
> special characters to oracle through a ODBC connection. The Oracle
> database is setup for UTF8 support.
>
> Here is the issue. I have a simple string, "louis de funès". When
> the data manually moved correctly in UTF8 the data comes up
> correctly. The Oracle dump() shows:
>
> WORKING DATA:
> Typ=1 Len=15: l,o,u,i,s, ,d,e, ,f,u,n,c3,a8,s
>
>
> However, when the same string is Inserted through the PHP application
> the data shows up in the db like this.
>
> NOT - WORKING:
> Typ=1 Len=17: l,o,u,i,s, ,d,e, ,f,u,n,c3,83,c2,a8,
>
> Anyone know why I would get 2 extra bytes (83,c2) added in the middle
> of the è character? Is the oracle client doing some other type of
> character set conversion before I insert it into the database.

I cannot tell you where exactly you have something misconfigured, but I can tell you what happens:

The e with accent grave is 0xE8 in ISO8859-1 and 0xC3A8 in UTF-8.

But 0xC3A8 can also be interpreted as the two characters 0xC3 and 0xA8 if you use a single byte encoding (like ISO8859-1).

If you convert these two bytes from ISO8859-1 to UTF-8, you'll end up with the two characters 0xC383 and 0xC2A8, which are exactly the four bytes you see.

So somewhere in your code you erroneously convert data that is already in UTF-8 into UTF-8 a second time.

You'll have to figure out where that happens and fix it.

> I have also noticed that when I change the NLS_LANG from
> AMERICAN_AMERICA.UTF8 to AMERICAN_AMERICA.WE8MSWIN1252 that the 4 byte
> 'è' character works and the 2 byte character doesn't.

I do not quite understand this. What do you mean, 'the 4 byte character works'?

Yours,
Laurenz Albe Received on Thu Apr 26 2007 - 07:02:39 CDT