Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Characters beyond ASCII in database ( US7ASCII and UTF8 character set )

Characters beyond ASCII in database ( US7ASCII and UTF8 character set )

From: Jesus \(John\) Salvo Jr. <john_at_softgame.com.au>
Date: Wed, 05 Dec 2001 04:58:27 GMT
Message-ID: <TPhP7.415385$bY5.1746486@news-server.bigpond.net.au>

Oracle 8.1.7.x.x on Solaris 8.

I am currently testing a conversion of an instance from the US7ASCII character set to a UTF8 character set. The client applications are all Java ( which by default uses UCS2 ) via the JDBC 1.2 thin drivers.

What I did was I exported the data from the production database using the US7ASCII char set and imported the data into a new database also using US7ASCII, thereby having a duplicate of the existing production database. I then changed the character set of the database where I imported the data into UTF8, also changed the national character set, as stated in the Oracle docs.

So far, the conversion of the character set was fine.

However ....

... just today, after tyring to find out why I was getting a "java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv" for one particular row, I narrowed it down to a single row where one column had a single character beyond the ASCII set ( an e with an acute on top ). It was a plain VARCHAR2(15) column. ( These messages and data comes from SMS messages ).

I looked at the production database ( US7ASCII ) to see if it had been like that before, and true enough, it was.

Here is how it looks in the US7ASCII database:

SQL> ;
  1* select nickname, length(nickname), dump(nickname) from player where player_id = 48589
SQL> / NICKNAME LENGTH(NICKNAME) DUMP(NICKNAME)
--------------- ---------------- ----------------------------------------

DÉVIL13                        7 Typ=1 Len=7: 68,201,86,73,76,49,51



Here is how it looks in the UTF8 database:

SQL> column dump(nickname) format a40
SQL> select nickname, length(nickname), dump(nickname) from player where player_id = 48589
  2
SQL> / NICKNAME LENGTH(NICKNAME) DUMP(NICKNAME)
--------------- ---------------- ----------------------------------------

D?IL13                         6 Typ=1 Len=7: 68,201,86,73,76,49,51



As you can see, they both have the same byte sequence, but Oracle treats them as different strings ( one is 7 characters, the other is 6 characters ).

Of course, the original problem is that the non-ASCII character got inserted into the database in the first place, when the character set was still US7ASCII. The JDBC driver then receives ( 201 ) and ( 86 ) as a single character, thereby failing because it is not a valid UTF8 character(!?).

If I then try to insert the same character into the UTF8 database, the byte sequence that I get is:

SQL> ;
  1* select text, length(text), dump(text) from messages where message_id = 6023026
SQL> /

TEXT                 LENGTH(TEXT) DUMP(TEXT)
-------------------- ------------ --------------------
E                               1 Typ=1 Len=2: 195,137

.... which is how the character would be saved in the database IF the DB character was already UTF8. Retrieving this row from the database using JDBC does NOT throw the exception noted abover.

So my questions are:

  1. Shouldn't Oracle have complained in the first place when someone was trying to insert a character beyond the ASCII when its DB character set is US7ASCII?
  2. Does anybody know of any way to look for non-ASCII characters for a specified set of tables .... and then convert them into the proper UTF8 encoding??

I could have created a separate UTF8 database from the onset and then import the data exported from the US7ASCII database, and probably then the import would have converted that non-ASCII character set to the proper UTF8 encoding ... would it?

Anyway, I did what I did above ( i.e.: create a separate US7ASCII db, import, then convert to UTF8 ) as an exercise on what will be done later on to the production database ... and to minimise the downtime required.

Regards, Received on Tue Dec 04 2001 - 22:58:27 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US