Re: Storage requirements of NCHAR columns

From: Ross <rossfreemantle_at_yahoo.co.uk>
Date: 11 Jul 2006 02:19:06 -0700
Message-ID: <1152609546.866731.250130@m79g2000cwm.googlegroups.com>

Andreas Piesk wrote:
> Kenneth wrote:
> > On 10 Jul 2006 09:59:49 -0700, "Ross" <rossfreemantle_at_yahoo.co.uk>
> > wrote:
> >
> > >
> > >"When you use NCHAR and NVARCHAR2 datatypes for storing multilingual
> > >data, the
> > >column size specified for a column is defined in number of characters.
> > >(The number of
> > >characters means the number of Unicode code units.)"
> > >
> > >This sould seem to suggest that the a NCHAR(30) column would actually
> > >require 30 bytes of storage (as UTF8 has a single-byte code unit).
> > >However, an example in Chapter 7 explicitly states that 90 bytes are
> > >required. I don't think the NLS_LENGTH_SEMANTICS parameter affects
> > >NCHAR, so which is correct?
> > >
> >
> > UTF8 is not a single-byte charset. It is a varying-width charset.
>
> he didn't say UTF8 is single-byte. he said UTF8 uses 8bit code units
> which is, according to unicode.org, true.
>
> > If you defines a NCHAR(30) column, it will contain 30 characters
> > (unless NULL). Always. Period.
>
> if the statement "The number of characters means the number of Unicode
> code units." is true, than his assumption is correct because for UTF8
> the number of characters == number of 8bit code units == number of
> bytes.
>
> i _think_ you're right, but Ross too. maybe the documentation is wrong
> or at least misleading.
>
> regards,
> -ap

Thanks, Andreas. I'm glad someone can see the source of my confusion. According to the Unicode glossary, the definition of 'Code Unit' is as follows:

"The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D28a in Section 3.9, Unicode Encoding Forms.)"

This matches the definition in the Oracle Globalization Support Guide glossary and confirms that UTF8 has a single-byte code unit.

The Oracle docs clearly suggest (in several places) that NCHAR columns are defined in terms of the number of code units. If this is true (and I suspect it isn't), an NCHAR(30) could only store 10 three-byte characters.

It seems unlikely that the same mistake would be made in several places. On the other hand, defining NCHAR columns in terms of code units seems an unecessary complication and requires the user to be aware of how characters are encoded. Received on Tue Jul 11 2006 - 04:19:06 CDT