Re: Characters beyond ASCII in database ( US7ASCII and UTF8 character set )

From: Jesus $John$ Salvo Jr. <john_at_softgame.com.au>
Date: Wed, 05 Dec 2001 05:33:55 GMT
Message-ID: <7liP7.415425$bY5.1747145@news-server.bigpond.net.au>

I just tried exporting that table from the US7ASCII again, but having my NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8 ... and then import that data into the UTF8 database, also having NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8 during the import.

That non-ASCII character, the e with an acute, became the letter I.

So my alternative of exporting the US7ASCII data as UTF8 will not work. I tried exporting the US7ASCII data as US7ASCII and then importing it as UTF8 but it does not allow me to ( I got some IMP error ).

Any other ideas?

Going back to my original questions:

Shouldn't Oracle have complained in the first place when someone was trying to insert a character beyond the ASCII when its DB character set is US7ASCII?
Does anybody know of any way to look for non-ASCII characters for a specified set of tables .... and then convert them into the proper UTF8 encoding??

"Jesus (John) Salvo Jr." <john_at_softgame.com.au> wrote in message news:TPhP7.415385$bY5.1746486_at_news-server.bigpond.net.au...
>
> Oracle 8.1.7.x.x on Solaris 8.
>
> I am currently testing a conversion of an instance from the US7ASCII
> character set to a UTF8 character set. The client applications are all
Java
> ( which by default uses UCS2 ) via the JDBC 1.2 thin drivers.
>
> What I did was I exported the data from the production database using the
> US7ASCII char set and imported the data into a new database also using
> US7ASCII, thereby having a duplicate of the existing production database.
I
> then changed the character set of the database where I imported the data
> into UTF8, also changed the national character set, as stated in the
Oracle
> docs.
>
> So far, the conversion of the character set was fine.
>
> However ....
>
> ... just today, after tyring to find out why I was getting a
> "java.sql.SQLException: Fail to convert between UTF8 and UCS2:
failUTF8Conv"
> for one particular row, I narrowed it down to a single row where one
column
> had a single character beyond the ASCII set ( an e with an acute on top ).
> It was a plain VARCHAR2(15) column. ( These messages and data comes from
SMS
> messages ).
>
> I looked at the production database ( US7ASCII ) to see if it had been
like
> that before, and true enough, it was.
>
>
> Here is how it looks in the US7ASCII database:
>
> SQL> ;
> 1* select nickname, length(nickname), dump(nickname) from player where
> player_id = 48589
> SQL> /
>
> NICKNAME LENGTH(NICKNAME) DUMP(NICKNAME)
> --------------- ---------------- ----------------------------------------
> DÉVIL13 7 Typ=1 Len=7: 68,201,86,73,76,49,51
>
>
>
> Here is how it looks in the UTF8 database:
>
> SQL> column dump(nickname) format a40
> SQL> select nickname, length(nickname), dump(nickname) from player where
> player_id = 48589
> 2
> SQL> /
>
> NICKNAME LENGTH(NICKNAME) DUMP(NICKNAME)
> --------------- ---------------- ----------------------------------------
> D?IL13 6 Typ=1 Len=7: 68,201,86,73,76,49,51
>
>
>
> As you can see, they both have the same byte sequence, but Oracle treats
> them as different strings ( one is 7 characters, the other is 6
> characters ).
>
> Of course, the original problem is that the non-ASCII character got
inserted
> into the database in the first place, when the character set was still
> US7ASCII. The JDBC driver then receives ( 201 ) and ( 86 ) as a single
> character, thereby failing because it is not a valid UTF8 character(!?).
>
> If I then try to insert the same character into the UTF8 database, the
byte
> sequence that I get is:
>
> SQL> ;
> 1* select text, length(text), dump(text) from messages where message_id
=
> 6023026
> SQL> /
>
> TEXT LENGTH(TEXT) DUMP(TEXT)
> -------------------- ------------ --------------------
> E 1 Typ=1 Len=2: 195,137
>
> .... which is how the character would be saved in the database IF the DB
> character was already UTF8. Retrieving this row from the database using
JDBC
> does NOT throw the exception noted abover.
>
> So my questions are:
>
> 1) Shouldn't Oracle have complained in the first place when someone was
> trying to insert a character beyond the ASCII when its DB character set is
> US7ASCII?
>
> 2) Does anybody know of any way to look for non-ASCII characters for a
> specified set of tables .... and then convert them into the proper UTF8
> encoding??
>
>
> I could have created a separate UTF8 database from the onset and then
import
> the data exported from the US7ASCII database, and probably then the import
> would have converted that non-ASCII character set to the proper UTF8
> encoding ... would it?
>
> Anyway, I did what I did above ( i.e.: create a separate US7ASCII db,
> import, then convert to UTF8 ) as an exercise on what will be done later
on
> to the production database ... and to minimise the downtime required.
>
>
> Regards,
>
>
>
>
>
Received on Tue Dec 04 2001 - 23:33:55 CST