Re: WE8ISO8859P1 convert to AL32UTF8 unicode character set question

From: lsllcm <lsllcm_at_gmail.com>
Date: Sat, 11 Apr 2009 05:04:32 -0700 (PDT)
Message-ID: <b5e20960-5d3f-40c5-a332-14fa49026770_at_f41g2000pra.googlegroups.com>



On Apr 10, 8:23 pm, "Laurenz Albe" <inv..._at_spam.to.invalid> wrote:
> lsllcmwrote:
> > But a little more complex.
>
> > I use java to read the both WE8ISO8859P1 and WE8MSWIN1252 dbs
>
> > 1 in WE8ISO8859P1 db, I print rs.getBytes("c1") array, the result is
> > as below, so it should be already unicode, and it does not do any
> > conversion.
> > 125
> > 115
> > 121
> > 115
> > 46
> > -123 ===========>> as same as 256-123 = 133
> > 77
> > 101
> > 100
> > 0
>
> rs.getBytes()?? Why would you select a string as Bytes?
> Can you show your code?
>
> I inserted your byte sequence into a WE8ISO8859P1 database,
> and selected it with getString() with JDBC, and I got what I expected:
>
> sys.żMed
>
>
>
>
>
> > 2 in WE8MSWIN1252 db, I print rs.getBytes("c1") array, the result is
> > as below, looks it is converted after it is read.
> > 125
> > 115
> > 121
> > 115
> > 46
> > -65 ===========>> as same as 256-65 = 191
> > 77
> > 101
> > 100
> > 0
>
> > If we use item 1 to convert, they are all wrong, but the UI are same
> > even they are wrong.
> > If we use item 2 to convert, the before is wrong, but after convert,
> > it is correct. but the UI will be different.
>
> I do not understand.
> Your example 2 does not seem to be correct, but I don't know what you did.
>
> > To be consistent, I choose item 1. At least, the data is not lost from
> > UI in both before and after..
>
> > If client cannot afford it, he/she should correct it at very early
> > time.
>
> > This is not one technical question, it is choose question.
>
> As I said, it's a time bomb.
> Choose to either fix it now or maybe blow up later.
>
> For example as soon as the customer wishes to insert a character that
> is not in the Windows character set.
>
> You can choose to ignore it, but I would at least tell my customer
> that there are corrupt data in the database.
>
> Yours,
> Laurenz Albe- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

Thanks for your comments:

  1. I use rs.getBytes() to get raw data from db, not after converted. The test on WE8MSWIN1252 is not correct. It occurs when I use 10.2.0.1 jdbc driver with db 10.2.0.4. After I use 10.2.0.4 jdbc driver, it prints the bytes as same as data in db.
  2. The rs.getString looks does some conversion. But I am clear how it converts between oracle db and java client.
  3. Because using System.out.println() should also do some conversion, I choose jsp to test the case. <-----------------------------------code begin ------------------------------> HttpServletRequest h_req = (HttpServletRequest) request; HttpServletResponse h_res = (HttpServletResponse) response;
        h_res.setContentType("text/html");
        h_res.setHeader("Pragma", "No-cache");
        h_res.addHeader("Cache-Control", "no-cache");
        h_res.addHeader("Expires", "Thu, 01 Jan 1970 00:00:01 GMT");

        h_req.setCharacterEncoding("utf-8");
        h_res.setCharacterEncoding("utf-8");


				Context initContext = new InitialContext();
				Context envContext  = (Context)initContext.lookup("java:/comp/
env");
				DataSource ds = (DataSource)envContext.lookup("jdbc/myoracle");
				Connection conn = ds.getConnection();

				PreparedStatement ps = null;
	   		ResultSet rs = null;
	   		String sql = "SELECT * from aaa ";

	  		ps = conn.prepareStatement(sql);

			  rs = ps.executeQuery();

      	byte[] aaa = new byte[1024];
      	String bb = "";
      	while (rs.next())
      	{
					aaa = rs.getBytes("c1");
					bb = rs.getString("c1");
					out.println(bb);
      	}
<-----------------------------------code end
------------------------------>

3.1 to connect to WE8ISO8859P1 db, it prints below in web page, the hex 85 is displayed like space char.
sys. Med

3.2 to connect to WE8MSWIN1252 db, it prints below in web page, the hex 83 is displayed as correct char
sys....Med

3.3 to connect to AL32UTF8 db, I insert two rows, one is from WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web page,
SQL> select dump (c1) from aaa;

DUMP(C1)



Typ=1 Len=9: 115,121,115,46,194,133,77,101,100 Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100
print in web page---------------------------------------------
sys. Med sys....Med

Your suggest is right, we need to make data correct even if it is wrong currectly.

4. The rs.getString looks does some conversion. But I am clear how it converts between oracle db and java client. From document B19306_01/ server.102/b14225/ch7progrunicode.htm#sthref924.

document



Data Conversion for Thin Drivers
SQL statements are always converted to either the database character set or to UTF-8 by the driver before they are sent to the database for processing. When the database character set is either US7ASCII or WE8ISO8859P1, the driver converts the SQL statement to the database character set. Otherwise, the driver converts the SQL statement to UTF-8 and notifies the database that a SQL statement requires further conversion before being processed. The database, in turn, converts the SQL statements from UTF-8 to the database character set. The database, in turn, converts the SQL statement to the database character set.

from document, I write one java application program and want to verify it

==============================================Code begin
=====================================
      byte[] aaa = new byte[1024];
      while (rs.next())
      {
	bb = rs.getString("c1");
      }

      aaa = bb.getBytes();
      for (int mm=0;mm<aaa.length;mm++)
     {
  	System.out.println(aaa[mm]);
     }
     //print default java character set
     System.out.println(Charset.defaultCharset() );
===================================code end
=============================================

4.1 test on WE8ISO8859P1 db
115
121
115
46
63 ---- looks convert to US7ASCII char, after verify convert (c1,'US7ASCII')
77
101
100
--default java character set
windows-1252

4.2 test on WE8MSWIN1252 db.
115
121
115
46
-123 -- does not do any conversion
77
101
100
windows-1252

4.3 test on AL32UTF8 db
--first row
115
121
115
46
63 ---- looks convert to US7ASCII char, after verify convert (c1,'US7ASCII')
77
101
100
--second row
115
121
115
46
-123 -- does not do any conversion
77
101
100
15
--default java character set
windows-1252

I cannot understand two sides

java uses UCS-2 (UTF-16) encoding, one char should have two bytes, but when use String.getBytes, it converts to windows-1252. The result is to print one byte for one char.

I will do the test again, and post here

Thanks for your suggestion again --:) Received on Sat Apr 11 2009 - 07:04:32 CDT

Original text of this message