Re: JDBC Character Set Problems

From: Sybrand Bakker <postbus_at_sybrandb.demon.nl>
Date: Fri, 31 Aug 2001 15:28:10 +0200
Message-ID: <tovvg2ka2kh3c1@news.demon.nl>

"Richard Lister" <pbsis_at_hotmail.com> wrote in message news:4e0a7502.0108310458.1bbe7ae0_at_posting.google.com...
> We are modifying a system built with the following technologies:
>
> Weblogic 4.5
> Oracle 8
> Java + JDBC
>
> The system is a web based application that stores large text
> announcements in oracle CLOB columns.
> The announcement text is inserted into the CLOB column via JDBC
> through a weblogic connection pool to the Oracle server.
> Currently announcement text only contains ASCII characters (0-127).
>
> We have a requirement to also store the Windows-1252 Euro, Bullet and
> Pound Sign characters in the database. These have character codes 128,
> 149 and 163 respectively.
>
> Unfortunately our database has been configured with an ISO-8859-1
> (Latin 1) character set. The Latin-1 character set does not have any
> characters in the range 128-160. Therefore when we submit a string
> containing the above characters to JDBC to be written to a CLOB
> column, something in the process decides to convert our Euro and
> Bullet characters to code 191 characters (an upside down question
> mark).
>
> I have come to the conclusion it is probably the NLS (national
> language support) functionality that is doing this conversion.
>
> We know that the database can physically store characters with codes
> of 128 and 149 as we can insert them using programs like SQL*Plus,
> TOAD and MS Access. These tools do not cause the characters to be
> converted to code 191.
>
> It does not appear to be an option to change the character set of our
> database as this is a live system and would involve a great deal of
> work.
>
> So my question is: Does anybody know how to get Java / JDBC / Oracle
> to accept characters that are not part of the Database's character set
> without converting them into upside down question marks (code 191).
>
> We either need to disable the translation routine or trick it into
> taking the characters.
>
> I've tried changing the weblogic.properties file to include a
> 'codeset' property in the connection string for cp1252 and cp932 - but
> this had no effect.
>
> Thanks for any advise,
>
> Richard Lister

From Metalink

Doc ID: Note:137127.1
Subject: Character sets, code pages, fonts and the NLS_LANG value Type: BULLETIN
Status: REVIEWED
Content Type: TEXT/PLAIN
Creation Date: 19-MAR-2001
Last Revision Date: 12-JUL-2001

This article is being delivered in Draft form and may contain errors. Please use the MetaLink "Feedback" button to advise Oracle of any issues related to this article.

PURPOSE

This article describes the relation between code pages, font files and Oracle character sets.

SCOPE & APPLICATION

    Audience         : Oracle users and DBAs
    Use              : Setting the proper character set value used with
                       NLS_LANG

Level of detail : medium
Limitation on use: none

CHARACTER SETS, CODE PAGES, FONTS AND THE NLS_LANG VALUE

Introduction:

People often speak of character sets (as in "Latin 1 character set", or "Simplified Chinese character set") as being the collection of characters

which can be displayed or printed.

However, user's don't view or print characters. Instead they view or print

glyphs (using the same meaning as in "hieroglyphs"). The character "Capital

Letter A" is represented by the glyph "A" in for example Arial.

A typeface is a collection of glyphs that share common characteristics, such as stroke width and the presence or absence of serifs.

A font family is a group of typefaces with similar characteristics. For example, the sans serif typefaces Arial, Arial Bold, Arial Bold Italic and

MS Sans Serif are all part of the Swiss font family.

A font is a collection of glyphs for a certain typeface at one particular

type size, excluding attributes such as bold or italic. When using for example the "Character Set Map" utility on Windows or the "xfontsel" utility in an X Window System environment, you can see which glyphs are part of a certain font. For example, a font for use in Greece will include

glyphs representing the Greek characters.

A character set is only a collection of characters without any glyphs associated with them. Characters are represented by so-called character codes. Character codes are generated and stored when a user inputs data into for example a document. Single-Byte character sets (SBCS) provide 256

character codes. Most SBCS-es are in use for American and European coutries.

For some coutries 8 bits are not sufficient to contain all characters used,

as in for example the Far East. In these countries it can occur that about

12,000 characters may be addressed at any one time. For these countries Multi-Byte character sets (MBCS) are used. When dealing with MBCS-es you encounter both fixed-width and variable-width encodings. Fixed-width encodings use a fixed number of <n> bytes for each stored character, where

n>=2. Variable-width encodings store characters in 1 or more bytes, where

some character can be stored in less bytes than another character of the same character set. For example, Unicode is a 16-bit encoding that encompasses many characters used in general text interchange throughout the world.

A font file contains the definition of a particular font and is used for rendering on a particular device. Examples of font files for the Windows platform are 'ArialBD.ttf' (which implements the Bold representation of the

Arial typeface), and 'WingDing.ttf' (implementing a symbolic character set).

Font files exist for implementing the same font with different rendering characteristics (for example, the Helvetica font in an X Window System environment can be implemented in separate font files for a 75 DPI and 100

DPI display).

Encoding scheme:

Besides the problem of visualizing the characters, there also needs to be

a standard of how to interpret the data stored in documents, databases and

such.

A character set defines characters to be placed in a particular order and

with a particular character code; for example, the capital letter A in the

U.S. ASCII character set has character code 65 and the capital letter S in

the same character set has code 83). The result of assigning a character code to each individual character of a character set is called an encoding

scheme.

Code pages:

In Windows environments you hear a lot about code pages. A code page is simply a particular encoding scheme. Windows code pages are usually defined

to support specific languages or groups of languages which share common writing systems. For example, code page 1253 provides character codes required in the Greek writing system.

A font implements a particular code page or set of code pages. For example,

the Arial font implements the code pages 1252, 1250, 1251, 1253, 1254, 1257 and the Macintosh character set.

Setting the character set in NLS_LANG:

When dealing with an Oracle environment, you normally want to set up support for a particular character set. You do this by setting the NLS_LANG
variable to contain a reference to a particular character set to be used by either the Oracle client or by creating a database using a particular character set.

The character set chosen merely defines the encoding scheme to be used by the Oracle product. For example, when using WE8ISO8859P1 Oracle character set, you indicate that the character codes will be interpreted using the "ISO/IEC 8859-1:1998 8-bit single-byte coded graphic character sets -- Part 1: Latin alphabet No. 1" encoding scheme.

Especially at the Oracle client side, the encoding scheme used by Oracle must match the encoding scheme used by the I/O system itself. This means that the font definition used by the OS and the character set used by the

Oracle client must use the same character codes for the same characters.

For example, the Windows fonts files for Western Europe implement at least

the code page 1252 (called "Latin 1"). The latest font files contain the Euro glyph with character code 128. Usually, when you want to use the Euro

glyph in an Oracle environment, you want to use the WE8ISO8859P15 character

set. However, in this character set the Euro glyph has character code 164.

This means the encoding schemes of the Windows code page 1252 and Oracle character set WE8ISO8859P15 do not match.

You can best check the definitions of the encoding schemes to clarify what

is meant above. As can be seen on the Microsoft site, the layout for code

page 1252 (which Microsoft Corp. has published on the http://www.microsoft.com/typography/unicode/1252.gif web site) indicates that the Euro symbol can be found at character code 0x80 (= 128 decimal).

However, the WE8ISO8859P15 character set (which implements the ISO Latin 9

standard) uses character code 0xA4 (= 164 decimal) for the same character,

as shown at http://czyborra.com/charsets/iso8859.html#Future.

So for example when on Windows you use the Oracle client character set WE8ISO8859P15 and type <Alt>+0128 to indicate an Euro glyph, this means that you will actually send the character code 128 to the database and not

character code 164.

When the encoding schemes of the font (read: code page) and Oracle character set do not match, an incorrect character code will be sent by the

client to store in the server. This will lead to logical data corruptions.

In this particular example you should use the Oracle character set WE8MSWIN1252 which uses exactly the same encoding scheme as the Windows code page does.

Please note that this has nothing to do with the database character set in

use. When you want to store e.g. the Euro symbol in the database, you can

still set the database character set to WE8ISO8859P15, whereas the client

uses WE8MSWIN1252. The Oracle software will correctly translate any symbols

to the new character code values of the database character set when storing

data into the database or retrieving data from the database.

Example:

The following example describes how incompatibility between the font and Oracle client character set (the NLS_LANG setting) used can cause problems

to occur. The example uses a font implementing the Windows Code Page 1252

and the Oracle client character set WE8ISO8859P15. The database has been created using a character set capable of storing the Euro symbol.

On behalf of the example the Euro symbol needs to be stored in the database. The table below shows some code points and their visible representations for both Windows Code Page 1252 and ISO Latin 9 (Oracle character set WE8ISO8859P15):

    Code point        Code Page 1252    WE8ISO8859P15
    ----------        --------------    -------------
    128               ?                 <not printable>
    164               ¤                 ?

When using a font implementing Code Page 1252 and setting the NLS_LANG to use WE8ISO8859P15, you can input the Euro sign in e.g. SQL*Plus by entering <Alt>+0128. However, code point 128 will be used to convert to the database character set. As this character is not recognized by Oracle to indicate the Euro symbol (as WE8ISO8859P15 indicates that the Euro symbol has code point 164), the Euro sign will get lost.

On the other hand, when the Euro symbol has somehow been stored in the database successfully (using code point 164), then querying this character using SQL*Plus on Windows will display the '¤' character (as this is the font glyph corresponding to code point 164).

Conclusion:

You must make sure that the font (read: code page used by the font) always

matches the encoding scheme of the Oracle character set. If not, you will

end up with logically corrupt data.

Glossary:

character: name for a particular location in an encoding scheme (such

                     as "Capital Letter A")
    encoding scheme: ordered set of characters
    glyph:           visual representation of a character
    font:            collection of glyphs for implementing a particular
                     character set. It also contains information like the
                     typeface
    font file:       representation of a font with particular
characteristics
                     such as weight, slant, size, language information,
                     attachment points for diacritical marks and underline
and
                     strikethrough information
    SBCS:            Single-Byte character set; characters in this set are
                     stored using a single byte
    MBCS:            Multi-Byte character set; characters in this set are
                     stored using 1 or more bytes
    writing system:  a set of character representations for a particular
                     country (such as the set of Greek glyphs for Greece)

RELATED DOCUMENTS

Oracle8i National Language Support Guide

http://www.unicode.org
http://www.iso.ch
http://www.microsoft.com/OpenType/unicode/cscp.htm
http://www.microsoft.com/
http://www.microsoft.com/globaldev

Please always type those three digits and two dots (so 5! keystrokes) consituting the version in *any* of your posts

Hth,

Sybrand Bakker, Senior Oracle DBA Received on Fri Aug 31 2001 - 08:28:10 CDT