Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: Oracle 8.i --> Oracle 9i + Unicode

Re: Oracle 8.i --> Oracle 9i + Unicode

From: jallan <jallan_at_smrtytrek.com>
Date: 22 Sep 2003 14:45:45 -0700
Message-ID: <299f1138.0309221345.6f75b4a8@posting.google.com>


дамјан г. <mk_at_net.mail.penguinista> wrote in message news:<3f6f42c9_at_news.mt.net.mk>...
> >>> I wish I knew the characters you were talking about! My newsreader can't
> >>> cope!!
> >>
> >> ok, I was talking about "o" with two dots above it and "a" with two dots
> >> above it :)
> >> Both very much used in my native language :)
> >>
> >> Tanel.
> >
> > I could've sworn o-umlaut and a-umlaut were double-byte in UTF8. But as
> > they're your natives and not mine, I'll bow to your superior knowledge on
> > the matter!!
>
> Everything out of ASCII is at least double byte in UTF-8 - most of the
> additional latin letters, cyrillics, and many others take up two bytes in
> UTF-8.
Totally correct.

O-umlaut and a-umlaut are double-byte in UTF-8, not triple-byte.

From http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf:

<< Beyond the ASCII range of Unicode, many of the non-ideographic scripts are represented by two bytes per code point in UTF-8; all nonsurrogate code points between U+0800 and U+FFFF are represented by three bytes; and supplementary code points above U+FFFF require four bytes. >>

Search on "3-6. W" at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for a chart showing the number of bytes needed for ranges of Unicode.

One byte characters are ASCII.

Two byte characters are all the Latin-x extended characters, many other Latin characters (including IPA characters), most western diacritical marks, unaccented Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana.

You can probably find the information in the standard at http://www.unicode.org/versions/Unicode4.0.0 to estimate your needs.

Note also Unicode can be represented with canonically composed characters either in composed or decomposed format, that is to say: á (a-acute) can be coded either as the single character SMALL LETTER A WITH ACUTE ACCENT or as the letter A followed by a COMBINING ACUTE ACCENT. Jim Allan Received on Mon Sep 22 2003 - 16:45:45 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US