Re: WE8ISO8859P1 convert to AL32UTF8 unicode character set question

From: lsllcm <lsllcm_at_gmail.com>
Date: Tue, 14 Apr 2009 18:35:00 -0700 (PDT)
Message-ID: <9fa0037e-c52f-494f-95ce-f26ca869a09a_at_b6g2000pre.googlegroups.com>



On Apr 14, 4:14 pm, "Laurenz Albe" <inv..._at_spam.to.invalid> wrote:
> lsllcmwrote:
>
> [...]
>
> > 3.3 to connect to AL32UTF8 db, I insert two rows, one is from
> > WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
> > page,
> > SQL> select dump (c1) from aaa;
>
> > DUMP(C1)
> > ---------------------------------------------------------------------------­-----
> > Typ=1 Len=9: 115,121,115,46,194,133,77,101,100
>
> 194,133 is the control character hex 85 from WE8ISO8859P1,
> converted to UTF-8, right.
>
> > Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100
>
> And 226,128,166 is the correct UTF-8 representation of
> the "horizontal ellipsis" character.
>
> > print in web page---------------------------------------------
> > sys. Med sys....Med
>
> > Your suggest is right, we need to make data correct even if it is
> > wrong currectly.
>
> I think that we have finally understood each other completly.
>
> What I had been missing out on is that there is some post-processing
> of the characters to display them in your web application.
> That is of course also a place where things can go wrong,
> even if you get the right characters out from the database.
>
> Now that I have seen your Java code, I also understand how
> you got the results you displayed:
> You were using java.lang.String.getBytes(), which uses the
> platform's default character set, in your case WINDOWS-1252.
>
> So the correct result for your test case would be "-123"
> for the weird character, which is the hex 85 you started with.
>
> I tried with your string and the following Java code on Windows:
>
>     ResultSet rs = stmt.executeQuery("....");
>     rs.next();
>     System.out.println("Using getBytes():");
>     byte[] arr = rs.getBytes(1);
>     for (int i=0; i<arr.length; ++i)
>         System.out.println("Byte " + i + " = " + arr[i]);
>     System.out.println("Using getString().getBytes():");
>     arr = rs.getString(1).getBytes();
>     for (int i=0; i<arr.length; ++i)
>         System.out.println("Byte " + i + " = " + arr[i]);
>     System.out.println(Charset.defaultCharset());
>
> and the result was:
>
>     Using getBytes():
>     Byte 0 = 115
>     Byte 1 = 121
>     Byte 2 = 115
>     Byte 3 = 46
>     Byte 4 = -30
>     Byte 5 = -128
>     Byte 6 = -90
>     Byte 7 = 77
>     Byte 8 = 101
>     Byte 9 = 100
>     Using getString().getBytes():
>     Byte 0 = 115
>     Byte 1 = 121
>     Byte 2 = 115
>     Byte 3 = 46
>     Byte 4 = -123
>     Byte 5 = 77
>     Byte 6 = 101
>     Byte 7 = 100
>     windows-1252
>
> Right.
>
> The java.sql.ResultSet.getBytes() method did not convert
> anything and got the bytes in database encoding (AL32UTF8).
> java.sql.ResultSet.getBytes() got the characters in UCS-2,
> and java.lang.String.getBytes() converts that into
> Windows encoding.
>
> That will of course only work if the character exists
> in Windows encoding. If you have anything else, e.g.
> a "r" (UNICODE hex 0159, Czech r with hacek), the result
> would be a question mark because Java cannot convert the
> character to Windows 1252.
>
> Yours,
> Laurenz Albe

Yes, it is clear now.

Thanks
Jacky Received on Tue Apr 14 2009 - 20:35:00 CDT

Original text of this message