Re: charset for kangi

From: Frank van Bortel <frank.van.bortel_at_gmail.com>
Date: Wed, 31 Jan 2007 20:16:07 +0100
Message-ID: <epqpto$1os$1@news3.zwoll1.ov.home.nl>

MTNorman schreef:
> On Jan 30, 2:37 pm, Frank van Bortel <frank.van.bor..._at_gmail.com>
> wrote:

>> I would not go for different fields at all when designing
>> such an application, but rather have one characterset.
>> I would always opt for the AL16UTF16, because:
>> - it is closest to the Windows code set (like it or not,
>>    most clients use that on the desktop to enter characters)
>> - it is fixed double byte.
>>
>> There may be other considerations, which would make the
>> first option a viable choice.
>> Simple UTF8, as you call it, is not 10G - AL32UTF8 is.
>> And it is a valid choice.
>>

>
> Frank,
>
> Because AL16UTF16 is a double byte character set - it cannot be used
> for the database character set. It can only be used for a national
> character set. I'm not sure what you mean by "fixed" either...

Oops - you are very correct; major slip on my part!

> AL16UTF16 expands two bytes at a time to handle multi-byte UTF
> characters. AL16UTF8 has a single byte base and expands one byte at a
> time to handle multi-byte UTF.

AL16UTF16 handles all European, and most Asian characters in 2 byte; it is a strict superset of UCS-2 (which I confused it with). Supplementary characters need 4 bytes, but in general, it is more compact than UTF8 for Asian characters.

Don't know where AL16UTF8 comes in - it's not supported by Oracle, afaik.
>
> Why would you have the same data in two different table columns? All
> fields/column that would contain Kanji data would use a national data
> type. You can store USASCII7 in the national character set as well as
> Kanji.

You are right - again.
>
> As to whether different data types cause problems in applications...
> which single data type do you use for all your fields/columns now -
> char, varchar2, number, clob, blob? I find no more problems with
> using nvarchar2 and with using char and varchar2. Yes, the developer
> needs to be aware of the characteristics of the different data types,
> particularly when assigning character data to declared variables...
> but then the developer should always be aware of the source and
> destination data types anyway. Just because oracle can usually
> implicitly convert a string to a number and back to a string without
> problems doesn't mean the developer has not just introduced a "bug"
> into the code that's going to show up as soon as zero leading numeric
> strings are used.

You mean "byte" versus "char" semantics. I'd change the database default to char, in such an environment.
>
> WE8MSWIN1252 is the most supported windows character set, but XP
> supports multiple character sets that can be changed on the fly.
> That's why it's so neat for a dumb single language person like me to
> see the XP character set changed from american english to canadian
> french to some indian set and watch the desktop change from something
> I can read to something I sorta recognize (that year of high school
> french was a long time ago), to something only my colleague can
> read.
>
> BTW - having both the database character set and the national
> character set in the UTF space reduces the opportunities of "losing"
> some bytes when accidently crossing data types.
>
> Regards,
> Margaret
>

So - AL32UTF8 it is - all the way.

-- 
Regards,
Frank van Bortel

Top-posting is one way to shut me up...

Received on Wed Jan 31 2007 - 13:16:07 CST