Re: WE8ISO8859P1 convert to AL32UTF8 unicode character set question

From: Laurenz Albe <invite_at_spam.to.invalid>
Date: Tue, 14 Apr 2009 10:14:15 +0200
Message-ID: <1239696878.711595_at_proxy.dienste.wien.at>



lsllcm wrote:
[...]
> 3.3 to connect to AL32UTF8 db, I insert two rows, one is from
> WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
> page,
> SQL> select dump (c1) from aaa;
>
> DUMP(C1)
> --------------------------------------------------------------------------------
> Typ=1 Len=9: 115,121,115,46,194,133,77,101,100

194,133 is the control character hex 85 from WE8ISO8859P1, converted to UTF-8, right.

> Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100

And 226,128,166 is the correct UTF-8 representation of the "horizontal ellipsis" character.

> print in web page---------------------------------------------
> sys. Med sys....Med
>
> Your suggest is right, we need to make data correct even if it is
> wrong currectly.

I think that we have finally understood each other completly.

What I had been missing out on is that there is some post-processing of the characters to display them in your web application. That is of course also a place where things can go wrong, even if you get the right characters out from the database.

Now that I have seen your Java code, I also understand how you got the results you displayed:
You were using java.lang.String.getBytes(), which uses the platform's default character set, in your case WINDOWS-1252.

So the correct result for your test case would be "-123" for the weird character, which is the hex 85 you started with.

I tried with your string and the following Java code on Windows:

    ResultSet rs = stmt.executeQuery("....");     rs.next();
    System.out.println("Using getBytes():");     byte[] arr = rs.getBytes(1);
    for (int i=0; i<arr.length; ++i)

        System.out.println("Byte " + i + " = " + arr[i]);     System.out.println("Using getString().getBytes():");     arr = rs.getString(1).getBytes();
    for (int i=0; i<arr.length; ++i)

        System.out.println("Byte " + i + " = " + arr[i]);     System.out.println(Charset.defaultCharset());

and the result was:

    Using getBytes():

    Byte 0 = 115
    Byte 1 = 121
    Byte 2 = 115
    Byte 3 = 46
    Byte 4 = -30
    Byte 5 = -128
    Byte 6 = -90
    Byte 7 = 77
    Byte 8 = 101
    Byte 9 = 100

    Using getString().getBytes():
    Byte 0 = 115
    Byte 1 = 121
    Byte 2 = 115
    Byte 3 = 46
    Byte 4 = -123
    Byte 5 = 77
    Byte 6 = 101
    Byte 7 = 100

    windows-1252

Right.

The java.sql.ResultSet.getBytes() method did not convert anything and got the bytes in database encoding (AL32UTF8). java.sql.ResultSet.getBytes() got the characters in UCS-2, and java.lang.String.getBytes() converts that into Windows encoding.

That will of course only work if the character exists in Windows encoding. If you have anything else, e.g. a "r" (UNICODE hex 0159, Czech r with hacek), the result would be a question mark because Java cannot convert the character to Windows 1252.

Yours,
Laurenz Albe Received on Tue Apr 14 2009 - 03:14:15 CDT

Original text of this message