Re: Oracle Bug? RPAD of Japanese (kanji) character in Oracle 10gR2 UTF8 database
Date: Fri, 16 Jan 2009 10:33:19 -0800 (PST)
Message-ID: <bd395586-84ef-4f98-97f0-25996fd3bba2_at_p23g2000prp.googlegroups.com>
On Jan 16, 12:23 pm, Kevin Kirkpatrick <kvnkrkpt..._at_gmail.com> wrote:
> I've searched metalink and not seen a mention, but want to make sure
> I'm not missing anything obvious in calling this a bug. CHR(15121570)
> is the UTF8 character point representing kanji character 'h'. It
> seems like RPAD is not properly padding it out to a full 4 characters
> in the example below:
>
> SELECT RPAD(CHR(15121570),4,'*') FROM DUAL;
> RESULT:
> h**
>
> A LENGTH() function reveals that the RPAD is only creating a string
> with 3 characters:
>
> SELECT LENGTH(RPAD(CHR(15121570),4,'*')) FROM DUAL;
> RESULT:
> 3
>
> The same logic against a different multi-byte UTF8 character, the
> Microsoft ellipse, shows the expected behavior:
>
> SELECT RPAD(CHR(14844070),4,'*') FROM DUAL;
> RESULT:
> ...***
>
> SELECT LENGTH(RPAD(CHR(14844070),4,'*')) FROM DUAL;
> RESULT:
> 4
>
> (note - i'm using '*' to make the RPAD functionality more visible; the
> same behavior occurs with the default blank-space, eg RPAD(CHR
> (15121570),4))
>
> Verifying the UTF8 encoding:
> select * from nls_database_parameters where parameter =
> 'NLS_CHARACTERSET';
> RESULT:
> NLS_CHARACTERSET UTF8
>
> So, the question - does anyone see any obvious oversight on my part,
> or should I consider this a "probable bug" in need of a TAR?
Nevermind, I think I've discovered the cause after using better search terms in Metalink. From a metalink article: LPad and Rpad count in "display units" wich means that a
select lengthb(rpad(bytestst ,10,'x')) lengthb,
lengthc(rpad(bytestst ,10,'x')) lengthc, rpad(bytestst ,10,'x') from
(select '' bytestst from dual);
will result in 10 characters but 11 bytes ! (seen is in UTF8 2 bytes).
In complex scripts where one "character on the display" may be
composed by
several combined characters the difference (nr of chars vs bytes)
may be
even bigger.
With some scripts the number of characters returned may also varry:
For example most Asian characters are 2 "display units" wide, so the Japanese
ס (wich is a FULL WIDTH (or zenkaku) character) will "count for 2" in L/RPAD
select lengthc(rpad(japanesechar ,10,'x')) lengthc,
rpad(japanesechar ,10,'x') from (select UNISTR('\4F4F') japanesechar from dual);
Lengthc will return 9 characters.
The thing is that the Japanese character ס takes 2 times the display
width
of a ascii character like x hence the RPAD will make the string 9
characters
in total (2 time display for one ס + 8 times display for 8 x = 10
display
units)
If you really want the number of characters then you can use something like:
RPAD ( str , n - LENGTHC(str),'c')
Use LENGTHB, if the requested width is in bytes, LENGTHC, if in characters
Or
SUBSTR( str || RPAD( 'c', n, 'c' ), 1, n )
Use SUBSTRB, if the requested width is in bytes, SUBSTR, if in characters
In above the
- str is the string to be padded. *'c' is the fill character (usually blank) -- we a assume singlebyte char
- n is the requested width in bytes or characters
Sorry for posting prematurely - hopefully the above info will be of use to someone! Received on Fri Jan 16 2009 - 19:33:19 CET