Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.misc -> Re: JDBC Character Set Problems

Re: JDBC Character Set Problems

From: Sybrand Bakker <postbus_at_sybrandb.demon.nl>
Date: Fri, 31 Aug 2001 15:28:10 +0200
Message-ID: <tovvg2ka2kh3c1@news.demon.nl>

"Richard Lister" <pbsis_at_hotmail.com> wrote in message news:4e0a7502.0108310458.1bbe7ae0_at_posting.google.com...
> We are modifying a system built with the following technologies:
>
> Weblogic 4.5
> Oracle 8
> Java + JDBC
>
> The system is a web based application that stores large text
> announcements in oracle CLOB columns.
> The announcement text is inserted into the CLOB column via JDBC
> through a weblogic connection pool to the Oracle server.
> Currently announcement text only contains ASCII characters (0-127).
>
> We have a requirement to also store the Windows-1252 Euro, Bullet and
> Pound Sign characters in the database. These have character codes 128,
> 149 and 163 respectively.
>
> Unfortunately our database has been configured with an ISO-8859-1
> (Latin 1) character set. The Latin-1 character set does not have any
> characters in the range 128-160. Therefore when we submit a string
> containing the above characters to JDBC to be written to a CLOB
> column, something in the process decides to convert our Euro and
> Bullet characters to code 191 characters (an upside down question
> mark).
>
> I have come to the conclusion it is probably the NLS (national
> language support) functionality that is doing this conversion.
>
> We know that the database can physically store characters with codes
> of 128 and 149 as we can insert them using programs like SQL*Plus,
> TOAD and MS Access. These tools do not cause the characters to be
> converted to code 191.
>
> It does not appear to be an option to change the character set of our
> database as this is a live system and would involve a great deal of
> work.
>
> So my question is: Does anybody know how to get Java / JDBC / Oracle
> to accept characters that are not part of the Database's character set
> without converting them into upside down question marks (code 191).
>
> We either need to disable the translation routine or trick it into
> taking the characters.
>
> I've tried changing the weblogic.properties file to include a
> 'codeset' property in the connection string for cp1252 and cp932 - but
> this had no effect.
>
> Thanks for any advise,
>
> Richard Lister

From Metalink

Doc ID: Note:137127.1
Subject: Character sets, code pages, fonts and the NLS_LANG value Type: BULLETIN
Status: REVIEWED
 Content Type: TEXT/PLAIN
Creation Date: 19-MAR-2001
Last Revision Date: 12-JUL-2001



This article is being delivered in Draft form and may contain errors. Please use the MetaLink "Feedback" button to advise Oracle of any issues related to this article.

PURPOSE


    This article describes the relation between code pages, font files and     Oracle character sets.

SCOPE & APPLICATION


    Audience         : Oracle users and DBAs
    Use              : Setting the proper character set value used with
                       NLS_LANG

    Level of detail : medium
    Limitation on use: none

CHARACTER SETS, CODE PAGES, FONTS AND THE NLS_LANG VALUE


Introduction:


    People often speak of character sets (as in "Latin 1 character set", or     "Simplified Chinese character set") as being the collection of characters

    which can be displayed or printed.

    However, user's don't view or print characters. Instead they view or print

    glyphs (using the same meaning as in "hieroglyphs"). The character "Capital

    Letter A" is represented by the glyph "A" in for example Arial.

    A typeface is a collection of glyphs that share common characteristics,     such as stroke width and the presence or absence of serifs.

    A font family is a group of typefaces with similar characteristics. For     example, the sans serif typefaces Arial, Arial Bold, Arial Bold Italic and

    MS Sans Serif are all part of the Swiss font family.

    A font is a collection of glyphs for a certain typeface at one particular

    type size, excluding attributes such as bold or italic. When using for     example the "Character Set Map" utility on Windows or the "xfontsel"     utility in an X Window System environment, you can see which glyphs are     part of a certain font. For example, a font for use in Greece will include

    glyphs representing the Greek characters.

    A character set is only a collection of characters without any glyphs     associated with them. Characters are represented by so-called character     codes. Character codes are generated and stored when a user inputs data     into for example a document. Single-Byte character sets (SBCS) provide 256

    character codes. Most SBCS-es are in use for American and European coutries.

    For some coutries 8 bits are not sufficient to contain all characters used,

    as in for example the Far East. In these countries it can occur that about

    12,000 characters may be addressed at any one time. For these countries     Multi-Byte character sets (MBCS) are used. When dealing with MBCS-es you     encounter both fixed-width and variable-width encodings. Fixed-width     encodings use a fixed number of <n> bytes for each stored character, where

    n>=2. Variable-width encodings store characters in 1 or more bytes, where

    some character can be stored in less bytes than another character of the     same character set. For example, Unicode is a 16-bit encoding that     encompasses many characters used in general text interchange throughout     the world.

    A font file contains the definition of a particular font and is used for     rendering on a particular device. Examples of font files for the Windows     platform are 'ArialBD.ttf' (which implements the Bold representation of the

    Arial typeface), and 'WingDing.ttf' (implementing a symbolic character set).

    Font files exist for implementing the same font with different rendering     characteristics (for example, the Helvetica font in an X Window System     environment can be implemented in separate font files for a 75 DPI and 100

    DPI display).

Encoding scheme:


    Besides the problem of visualizing the characters, there also needs to be

    a standard of how to interpret the data stored in documents, databases and

    such.

    A character set defines characters to be placed in a particular order and

    with a particular character code; for example, the capital letter A in the

    U.S. ASCII character set has character code 65 and the capital letter S in

    the same character set has code 83). The result of assigning a character     code to each individual character of a character set is called an encoding

    scheme.

Code pages:


    In Windows environments you hear a lot about code pages. A code page is     simply a particular encoding scheme. Windows code pages are usually defined

    to support specific languages or groups of languages which share common     writing systems. For example, code page 1253 provides character codes     required in the Greek writing system.

    A font implements a particular code page or set of code pages. For example,

    the Arial font implements the code pages 1252, 1250, 1251, 1253, 1254,     1257 and the Macintosh character set.

Setting the character set in NLS_LANG:


    When dealing with an Oracle environment, you normally want to set up     support for a particular character set. You do this by setting the NLS_LANG
    variable to contain a reference to a particular character set to be used     by either the Oracle client or by creating a database using a particular     character set.

    The character set chosen merely defines the encoding scheme to be used     by the Oracle product. For example, when using WE8ISO8859P1 Oracle     character set, you indicate that the character codes will be interpreted     using the "ISO/IEC 8859-1:1998 8-bit single-byte coded graphic character     sets -- Part 1: Latin alphabet No. 1" encoding scheme.

    Especially at the Oracle client side, the encoding scheme used by Oracle     must match the encoding scheme used by the I/O system itself. This means     that the font definition used by the OS and the character set used by the

    Oracle client must use the same character codes for the same characters.

    For example, the Windows fonts files for Western Europe implement at least

    the code page 1252 (called "Latin 1"). The latest font files contain the     Euro glyph with character code 128. Usually, when you want to use the Euro

    glyph in an Oracle environment, you want to use the WE8ISO8859P15 character

    set. However, in this character set the Euro glyph has character code 164.

    This means the encoding schemes of the Windows code page 1252 and Oracle     character set WE8ISO8859P15 do not match.

    You can best check the definitions of the encoding schemes to clarify what

    is meant above. As can be seen on the Microsoft site, the layout for code

    page 1252 (which Microsoft Corp. has published on the     http://www.microsoft.com/typography/unicode/1252.gif web site) indicates     that the Euro symbol can be found at character code 0x80 (= 128 decimal).

    However, the WE8ISO8859P15 character set (which implements the ISO Latin 9

    standard) uses character code 0xA4 (= 164 decimal) for the same character,

    as shown at http://czyborra.com/charsets/iso8859.html#Future.

    So for example when on Windows you use the Oracle client character set     WE8ISO8859P15 and type <Alt>+0128 to indicate an Euro glyph, this means     that you will actually send the character code 128 to the database and not

    character code 164.

    When the encoding schemes of the font (read: code page) and Oracle     character set do not match, an incorrect character code will be sent by the

    client to store in the server. This will lead to logical data corruptions.

    In this particular example you should use the Oracle character set     WE8MSWIN1252 which uses exactly the same encoding scheme as the Windows     code page does.

    Please note that this has nothing to do with the database character set in

    use. When you want to store e.g. the Euro symbol in the database, you can

    still set the database character set to WE8ISO8859P15, whereas the client

    uses WE8MSWIN1252. The Oracle software will correctly translate any symbols

    to the new character code values of the database character set when storing

    data into the database or retrieving data from the database.

Example:


    The following example describes how incompatibility between the font and     Oracle client character set (the NLS_LANG setting) used can cause problems

    to occur. The example uses a font implementing the Windows Code Page 1252

    and the Oracle client character set WE8ISO8859P15. The database has been     created using a character set capable of storing the Euro symbol.

    On behalf of the example the Euro symbol needs to be stored in the     database. The table below shows some code points and their visible     representations for both Windows Code Page 1252 and ISO Latin 9 (Oracle     character set WE8ISO8859P15):

    Code point        Code Page 1252    WE8ISO8859P15
    ----------        --------------    -------------
    128               ?                 <not printable>
    164               ¤                 ?

    When using a font implementing Code Page 1252 and setting the NLS_LANG     to use WE8ISO8859P15, you can input the Euro sign in e.g. SQL*Plus     by entering <Alt>+0128. However, code point 128 will be used to convert     to the database character set. As this character is not recognized by     Oracle to indicate the Euro symbol (as WE8ISO8859P15 indicates that     the Euro symbol has code point 164), the Euro sign will get lost.

    On the other hand, when the Euro symbol has somehow been stored in the     database successfully (using code point 164), then querying this     character using SQL*Plus on Windows will display the '¤' character (as     this is the font glyph corresponding to code point 164).

Conclusion:


    You must make sure that the font (read: code page used by the font) always

    matches the encoding scheme of the Oracle character set. If not, you will

    end up with logically corrupt data.

Glossary:


    character: name for a particular location in an encoding scheme (such

                     as "Capital Letter A")
    encoding scheme: ordered set of characters
    glyph:           visual representation of a character
    font:            collection of glyphs for implementing a particular
                     character set. It also contains information like the
                     typeface
    font file:       representation of a font with particular
characteristics
                     such as weight, slant, size, language information,
                     attachment points for diacritical marks and underline
and
                     strikethrough information
    SBCS:            Single-Byte character set; characters in this set are
                     stored using a single byte
    MBCS:            Multi-Byte character set; characters in this set are
                     stored using 1 or more bytes
    writing system:  a set of character representations for a particular
                     country (such as the set of Greek glyphs for Greece)

RELATED DOCUMENTS


Oracle8i National Language Support Guide

http://www.unicode.org
http://www.iso.ch
http://www.microsoft.com/OpenType/unicode/cscp.htm
http://www.microsoft.com/
http://www.microsoft.com/globaldev


Please always type those three digits and two dots (so 5! keystrokes) consituting the version in *any* of your posts

Hth,

Sybrand Bakker, Senior Oracle DBA Received on Fri Aug 31 2001 - 08:28:10 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US