Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.misc -> Re: JDBC Character Set Problems
"Richard Lister" <pbsis_at_hotmail.com> wrote in message
news:4e0a7502.0108310458.1bbe7ae0_at_posting.google.com...
> We are modifying a system built with the following technologies:
>
> Weblogic 4.5
> Oracle 8
> Java + JDBC
>
> The system is a web based application that stores large text
> announcements in oracle CLOB columns.
> The announcement text is inserted into the CLOB column via JDBC
> through a weblogic connection pool to the Oracle server.
> Currently announcement text only contains ASCII characters (0-127).
>
> We have a requirement to also store the Windows-1252 Euro, Bullet and
> Pound Sign characters in the database. These have character codes 128,
> 149 and 163 respectively.
>
> Unfortunately our database has been configured with an ISO-8859-1
> (Latin 1) character set. The Latin-1 character set does not have any
> characters in the range 128-160. Therefore when we submit a string
> containing the above characters to JDBC to be written to a CLOB
> column, something in the process decides to convert our Euro and
> Bullet characters to code 191 characters (an upside down question
> mark).
>
> I have come to the conclusion it is probably the NLS (national
> language support) functionality that is doing this conversion.
>
> We know that the database can physically store characters with codes
> of 128 and 149 as we can insert them using programs like SQL*Plus,
> TOAD and MS Access. These tools do not cause the characters to be
> converted to code 191.
>
> It does not appear to be an option to change the character set of our
> database as this is a live system and would involve a great deal of
> work.
>
> So my question is: Does anybody know how to get Java / JDBC / Oracle
> to accept characters that are not part of the Database's character set
> without converting them into upside down question marks (code 191).
>
> We either need to disable the translation routine or trick it into
> taking the characters.
>
> I've tried changing the weblogic.properties file to include a
> 'codeset' property in the connection string for cp1252 and cp932 - but
> this had no effect.
>
> Thanks for any advise,
>
> Richard Lister
From Metalink
Doc ID: Note:137127.1
Subject: Character sets, code pages, fonts and the NLS_LANG value
Type: BULLETIN
Status: REVIEWED
Content Type: TEXT/PLAIN
Creation Date: 19-MAR-2001
Last Revision Date: 12-JUL-2001
PURPOSE
This article describes the relation between code pages, font files and Oracle character sets.
SCOPE & APPLICATION
Audience : Oracle users and DBAs Use : Setting the proper character set value used with NLS_LANG
CHARACTER SETS, CODE PAGES, FONTS AND THE NLS_LANG VALUE
Introduction:
People often speak of character sets (as in "Latin 1 character set", or "Simplified Chinese character set") as being the collection of characters
which can be displayed or printed.
However, user's don't view or print characters. Instead they view or print
glyphs (using the same meaning as in "hieroglyphs"). The character "Capital
Letter A" is represented by the glyph "A" in for example Arial.
A typeface is a collection of glyphs that share common characteristics, such as stroke width and the presence or absence of serifs.
A font family is a group of typefaces with similar characteristics. For example, the sans serif typefaces Arial, Arial Bold, Arial Bold Italic and
MS Sans Serif are all part of the Swiss font family.
A font is a collection of glyphs for a certain typeface at one particular
type size, excluding attributes such as bold or italic. When using for example the "Character Set Map" utility on Windows or the "xfontsel" utility in an X Window System environment, you can see which glyphs are part of a certain font. For example, a font for use in Greece will include
glyphs representing the Greek characters.
A character set is only a collection of characters without any glyphs associated with them. Characters are represented by so-called character codes. Character codes are generated and stored when a user inputs data into for example a document. Single-Byte character sets (SBCS) provide 256
character codes. Most SBCS-es are in use for American and European coutries.
For some coutries 8 bits are not sufficient to contain all characters used,
as in for example the Far East. In these countries it can occur that about
12,000 characters may be addressed at any one time. For these countries Multi-Byte character sets (MBCS) are used. When dealing with MBCS-es you encounter both fixed-width and variable-width encodings. Fixed-width encodings use a fixed number of <n> bytes for each stored character, where
n>=2. Variable-width encodings store characters in 1 or more bytes, where
some character can be stored in less bytes than another character of the same character set. For example, Unicode is a 16-bit encoding that encompasses many characters used in general text interchange throughout the world.
A font file contains the definition of a particular font and is used for rendering on a particular device. Examples of font files for the Windows platform are 'ArialBD.ttf' (which implements the Bold representation of the
Arial typeface), and 'WingDing.ttf' (implementing a symbolic character set).
Font files exist for implementing the same font with different rendering characteristics (for example, the Helvetica font in an X Window System environment can be implemented in separate font files for a 75 DPI and 100
DPI display).
Encoding scheme:
Besides the problem of visualizing the characters, there also needs to be
a standard of how to interpret the data stored in documents, databases and
such.
A character set defines characters to be placed in a particular order and
with a particular character code; for example, the capital letter A in the
U.S. ASCII character set has character code 65 and the capital letter S in
the same character set has code 83). The result of assigning a character code to each individual character of a character set is called an encoding
scheme.
Code pages:
In Windows environments you hear a lot about code pages. A code page is simply a particular encoding scheme. Windows code pages are usually defined
to support specific languages or groups of languages which share common writing systems. For example, code page 1253 provides character codes required in the Greek writing system.
A font implements a particular code page or set of code pages. For example,
the Arial font implements the code pages 1252, 1250, 1251, 1253, 1254, 1257 and the Macintosh character set.
Setting the character set in NLS_LANG:
When dealing with an Oracle environment, you normally want to set up
support for a particular character set. You do this by setting the
NLS_LANG
variable to contain a reference to a particular character set to be used
by either the Oracle client or by creating a database using a particular
character set.
The character set chosen merely defines the encoding scheme to be used by the Oracle product. For example, when using WE8ISO8859P1 Oracle character set, you indicate that the character codes will be interpreted using the "ISO/IEC 8859-1:1998 8-bit single-byte coded graphic character sets -- Part 1: Latin alphabet No. 1" encoding scheme.
Especially at the Oracle client side, the encoding scheme used by Oracle must match the encoding scheme used by the I/O system itself. This means that the font definition used by the OS and the character set used by the
Oracle client must use the same character codes for the same characters.
For example, the Windows fonts files for Western Europe implement at least
the code page 1252 (called "Latin 1"). The latest font files contain the Euro glyph with character code 128. Usually, when you want to use the Euro
glyph in an Oracle environment, you want to use the WE8ISO8859P15 character
set. However, in this character set the Euro glyph has character code 164.
This means the encoding schemes of the Windows code page 1252 and Oracle character set WE8ISO8859P15 do not match.
You can best check the definitions of the encoding schemes to clarify what
is meant above. As can be seen on the Microsoft site, the layout for code
page 1252 (which Microsoft Corp. has published on the http://www.microsoft.com/typography/unicode/1252.gif web site) indicates that the Euro symbol can be found at character code 0x80 (= 128 decimal).
However, the WE8ISO8859P15 character set (which implements the ISO Latin 9
standard) uses character code 0xA4 (= 164 decimal) for the same character,
as shown at http://czyborra.com/charsets/iso8859.html#Future.
So for example when on Windows you use the Oracle client character set WE8ISO8859P15 and type <Alt>+0128 to indicate an Euro glyph, this means that you will actually send the character code 128 to the database and not
character code 164.
When the encoding schemes of the font (read: code page) and Oracle character set do not match, an incorrect character code will be sent by the
client to store in the server. This will lead to logical data corruptions.
In this particular example you should use the Oracle character set WE8MSWIN1252 which uses exactly the same encoding scheme as the Windows code page does.
Please note that this has nothing to do with the database character set in
use. When you want to store e.g. the Euro symbol in the database, you can
still set the database character set to WE8ISO8859P15, whereas the client
uses WE8MSWIN1252. The Oracle software will correctly translate any symbols
to the new character code values of the database character set when storing
data into the database or retrieving data from the database.
Example:
The following example describes how incompatibility between the font and Oracle client character set (the NLS_LANG setting) used can cause problems
to occur. The example uses a font implementing the Windows Code Page 1252
and the Oracle client character set WE8ISO8859P15. The database has been created using a character set capable of storing the Euro symbol.
On behalf of the example the Euro symbol needs to be stored in the database. The table below shows some code points and their visible representations for both Windows Code Page 1252 and ISO Latin 9 (Oracle character set WE8ISO8859P15):
Code point Code Page 1252 WE8ISO8859P15 ---------- -------------- ------------- 128 ? <not printable> 164 ¤ ?
When using a font implementing Code Page 1252 and setting the NLS_LANG to use WE8ISO8859P15, you can input the Euro sign in e.g. SQL*Plus by entering <Alt>+0128. However, code point 128 will be used to convert to the database character set. As this character is not recognized by Oracle to indicate the Euro symbol (as WE8ISO8859P15 indicates that the Euro symbol has code point 164), the Euro sign will get lost.
On the other hand, when the Euro symbol has somehow been stored in the database successfully (using code point 164), then querying this character using SQL*Plus on Windows will display the '¤' character (as this is the font glyph corresponding to code point 164).
Conclusion:
You must make sure that the font (read: code page used by the font) always
matches the encoding scheme of the Oracle character set. If not, you will
end up with logically corrupt data.
Glossary:
character: name for a particular location in an encoding scheme (such
as "Capital Letter A") encoding scheme: ordered set of characters glyph: visual representation of a character font: collection of glyphs for implementing a particular character set. It also contains information like the typeface font file: representation of a font with particular characteristics such as weight, slant, size, language information, attachment points for diacritical marks and underline and strikethrough information SBCS: Single-Byte character set; characters in this set are stored using a single byte MBCS: Multi-Byte character set; characters in this set are stored using 1 or more bytes writing system: a set of character representations for a particular country (such as the set of Greek glyphs for Greece)
RELATED DOCUMENTS
Oracle8i National Language Support Guide
http://www.unicode.org http://www.iso.ch http://www.microsoft.com/OpenType/unicode/cscp.htm http://www.microsoft.com/ http://www.microsoft.com/globaldev
Please always type those three digits and two dots (so 5! keystrokes) consituting the version in *any* of your posts
Hth,
Sybrand Bakker, Senior Oracle DBA Received on Fri Aug 31 2001 - 08:28:10 CDT