Re: WE8ISO8859P1 convert to AL32UTF8 unicode character set question
Date: Tue, 14 Apr 2009 10:14:15 +0200
Message-ID: <1239696878.711595_at_proxy.dienste.wien.at>
lsllcm wrote:
[...]
> 3.3 to connect to AL32UTF8 db, I insert two rows, one is from
> WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
> page,
> SQL> select dump (c1) from aaa;
>
> DUMP(C1)
> --------------------------------------------------------------------------------
> Typ=1 Len=9: 115,121,115,46,194,133,77,101,100
194,133 is the control character hex 85 from WE8ISO8859P1, converted to UTF-8, right.
> Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100
And 226,128,166 is the correct UTF-8 representation of the "horizontal ellipsis" character.
> print in web page---------------------------------------------
> sys. Med sys....Med
>
> Your suggest is right, we need to make data correct even if it is
> wrong currectly.
I think that we have finally understood each other completly.
What I had been missing out on is that there is some post-processing of the characters to display them in your web application. That is of course also a place where things can go wrong, even if you get the right characters out from the database.
Now that I have seen your Java code, I also understand how
you got the results you displayed:
You were using java.lang.String.getBytes(), which uses the
platform's default character set, in your case WINDOWS-1252.
So the correct result for your test case would be "-123" for the weird character, which is the hex 85 you started with.
I tried with your string and the following Java code on Windows:
ResultSet rs = stmt.executeQuery("....");
rs.next();
System.out.println("Using getBytes():");
byte[] arr = rs.getBytes(1);
for (int i=0; i<arr.length; ++i)
System.out.println("Byte " + i + " = " + arr[i]);
System.out.println("Using getString().getBytes():");
arr = rs.getString(1).getBytes();
for (int i=0; i<arr.length; ++i)
System.out.println("Byte " + i + " = " + arr[i]); System.out.println(Charset.defaultCharset());
and the result was:
Using getBytes():
Byte 0 = 115 Byte 1 = 121 Byte 2 = 115 Byte 3 = 46 Byte 4 = -30 Byte 5 = -128 Byte 6 = -90 Byte 7 = 77 Byte 8 = 101 Byte 9 = 100
Using getString().getBytes():
Byte 0 = 115 Byte 1 = 121 Byte 2 = 115 Byte 3 = 46 Byte 4 = -123 Byte 5 = 77 Byte 6 = 101 Byte 7 = 100
windows-1252
Right.
The java.sql.ResultSet.getBytes() method did not convert anything and got the bytes in database encoding (AL32UTF8). java.sql.ResultSet.getBytes() got the characters in UCS-2, and java.lang.String.getBytes() converts that into Windows encoding.
That will of course only work if the character exists in Windows encoding. If you have anything else, e.g. a "r" (UNICODE hex 0159, Czech r with hacek), the result would be a question mark because Java cannot convert the character to Windows 1252.
Yours,
Laurenz Albe
Received on Tue Apr 14 2009 - 03:14:15 CDT