Re: Oracle Bug? RPAD of Japanese (kanji) character in Oracle 10gR2 UTF8 database

From: Kevin Kirkpatrick <kvnkrkptrck_at_gmail.com>
Date: Fri, 16 Jan 2009 10:33:19 -0800 (PST)
Message-ID: <bd395586-84ef-4f98-97f0-25996fd3bba2_at_p23g2000prp.googlegroups.com>


On Jan 16, 12:23 pm, Kevin Kirkpatrick <kvnkrkpt..._at_gmail.com> wrote:
> I've searched metalink and not seen a mention, but want to make sure
> I'm not missing anything obvious in calling this a bug. CHR(15121570)
> is the UTF8 character point representing kanji character 'h'. It
> seems like RPAD is not properly padding it out to a full 4 characters
> in the example below:
>
> SELECT RPAD(CHR(15121570),4,'*') FROM DUAL;
> RESULT:
> h**
>
> A LENGTH() function reveals that the RPAD is only creating a string
> with 3 characters:
>
> SELECT LENGTH(RPAD(CHR(15121570),4,'*')) FROM DUAL;
> RESULT:
> 3
>
> The same logic against a different multi-byte UTF8 character, the
> Microsoft ellipse, shows the expected behavior:
>
> SELECT RPAD(CHR(14844070),4,'*') FROM DUAL;
> RESULT:
> ...***
>
> SELECT LENGTH(RPAD(CHR(14844070),4,'*')) FROM DUAL;
> RESULT:
> 4
>
> (note - i'm using '*' to make the RPAD functionality more visible; the
> same behavior occurs with the default blank-space, eg RPAD(CHR
> (15121570),4))
>
> Verifying the UTF8 encoding:
> select * from nls_database_parameters where parameter =
> 'NLS_CHARACTERSET';
> RESULT:
> NLS_CHARACTERSET UTF8
>
> So, the question - does anyone see any obvious oversight on my part,
> or should I consider this a "probable bug" in need of a TAR?

Nevermind, I think I've discovered the cause after using better search terms in Metalink. From a metalink article: LPad and Rpad count in "display units" wich means that a

  select lengthb(rpad(bytestst ,10,'x')) lengthb,

         lengthc(rpad(bytestst ,10,'x')) lengthc,
         rpad(bytestst ,10,'x') from

  (select '' bytestst from dual);

  will result in 10 characters but 11 bytes ! (seen is in UTF8 2 bytes).

  In complex scripts where one "character on the display" may be composed by
  several combined characters the difference (nr of chars vs bytes) may be
  even bigger.

  With some scripts the number of characters returned may also varry:

  For example most Asian characters are 2 "display units" wide, so the Japanese

   ס (wich is a FULL WIDTH (or zenkaku) character) will "count for 2" in L/RPAD

  select lengthc(rpad(japanesechar ,10,'x')) lengthc,

          rpad(japanesechar ,10,'x') from (select UNISTR('\4F4F')   japanesechar from dual);

  Lengthc will return 9 characters.

  The thing is that the Japanese character ס takes 2 times the display width
  of a ascii character like x hence the RPAD will make the string 9 characters
  in total (2 time display for one ס + 8 times display for 8 x = 10 display
  units)

  If you really want the number of characters then you can use something like:

   RPAD ( str , n - LENGTHC(str),'c')

   Use LENGTHB, if the requested width is in bytes, LENGTHC, if in characters

 Or

   SUBSTR( str || RPAD( 'c', n, 'c' ), 1, n )

   Use SUBSTRB, if the requested width is in bytes, SUBSTR, if in characters

   In above the

  • str is the string to be padded. *'c' is the fill character (usually blank) -- we a assume singlebyte char
  • n is the requested width in bytes or characters

Sorry for posting prematurely - hopefully the above info will be of use to someone! Received on Fri Jan 16 2009 - 19:33:19 CET

Original text of this message