Re: Strange Euro sign handling

From: Peter J. Holzer <hjp-usenet2_at_hjp.at>
Date: Sun, 12 Aug 2012 23:58:08 +0200
Message-ID: <slrnk2g9ni.f75.hjp-usenet2_at_hrunkner.hjp.at>



On 2012-08-11 14:49, Franz Kruse <Franz.Kruse_at_t-online.de> wrote:
> maybe one of you can explain a phenomenon we observe:
>
> We are using an Oracle 9i database on a Solaris server as part of a web
> based system with both Perl and Java programs accessing the database.
> The database uses the WEISO8859P1 (Latin-1) character set, in which the
> Euro sign does not exist.
>
> Nevertheless, the system does handle the Euro sign. We can insert Euro
> signs through web forms and get them properly displayed in web pages
> generated from database queries - as long as done through Perl programs.

Let me guess:

  1. The web forms have the content type "text/html; charset=iso-8859-1".
  2. The Perl scripts have NLS_LANG set to WEISO8859P1 and just shuffle byte strings between the database and the browser.

> When doing queries from Java, the situation is different:
>
> - For VARCHAR fields, the ResultSet method getString() yields
> 0x0080 for a Euro sign (which is the Windows-1252 encoding).

For compatibility with Internet Explorer, most browsers assume charset=windows-1252 when they encounter charset=iso-8859-1. Therefore, when a user enters a Euro sign in a form, the browser will transmit it as "%80" (instead of rejecting it). Similarly, when the browser reads the byte 0x80 in a page, it will display a Euro sign (instead of an "unknown character" symbol). So the Perl code will receive a character 0x80 from the form, store it in the database, retrieve it again, send it to the browser and the browser will display a Euro sign. As long as the Perl code doesn't try to do anything with that character it won't notice that it isn't really a Euro sign.

That would probably work in Java, too, if you just passed on that 0x0080 character to the browser and claimed that the encoding is ISO-8859-1. Since you noticed that it isn't working, you are probably doing something different - maybe use utf8 encoding, maybe just some more complex processing.

> - For CLOBs (CLOB object obtained by the OracleResultSet
> method getCLOB()), the result depends on the kind of access:
>
> - When using getAsciiStream() and creating an InputStreamReader
> with encoding ISO-8859-1 from it, we get 0x00FD for a Euro sign.

I assume that 0x00FD is just 0xFFFD truncated to 8 bits.

> - When using getCharacterStream(), we get 0xFFFD (which is the
> Unicode representation of an invalid character).

This is strange. It should also be 0x0080, it that is what is stored in the database (did you check that? - use Perl or the SQL dump() function). It is possible that getCharacterStream() decides that the characters 0x80-0x9F in ISO-8859-1 are "invalid". It is also possible that the database doesn't contain 0x80, because the Java code which inserted the CLOB already decided that this isn't a valid character, but that seems even more unlikely - for starters, what would it contain? There is no way to convert 0xFFFD into ISO-8859-1.

        hp

-- 
   _  | Peter J. Holzer    | Deprecating human carelessness and
|_|_) | Sysadmin WSR       | ignorance has no successful track record.
| |   | hjp_at_hjp.at         | 
__/   | http://www.hjp.at/ |  -- Bill Code on asrg_at_irtf.org
Received on Sun Aug 12 2012 - 16:58:08 CDT

Original text of this message