Re: Verify Encoding in UTF8 Oracle DB

From: TC <donotuse_at_donotuse.com>
Date: Fri, 28 Mar 2003 21:29:36 GMT
Message-ID: <493ha.263897$6b3.804676_at_rwcrnsc51.ops.asp.att.net>


Here is how UTF8 encoding works

The number of bytes needed for one character depends on the character's Unicode code:

1 byte if it is in range 0000-007f,
2 bytes if it is in range 0080-07ff,
3 bytes if it is in range 0800-ffff.

Thus with UTF8 any character beyond 007f (ASCII 127) will be converted to 2 bytes and those beyond 00ff (ASCII 255) will be converted to 3 bytes.

That being said, interestingly the three characters that are messing up your algorithm, that you have mentioned, are part of the Latin-1 Supplement codeset range, 0080-00ff. The peculiarity of this range being that these are single byte characters in a West European encoding scheme while these are multibyte when it comes to unicode.

Wonder whether these characters are among those single-byte characters in char that have no multi-byte equivalents and appear in the output string as single-byte characters when to_multi_byte function is used.

//tc

"JeepGary" <gary_at_51075.com> wrote in message news:ffbd7fa6.0303281116.77d47085_at_posting.google.com...
> The DB that I work on receives data from many different batch files.
> The group importing the file will specify the encoding of the file,
> but some times they are wrong. Note that this is a global operation
> and the data can be Latin, Asian, etc. The encoding information is
> then used to translate the data to UTF8. If the encoding that was
> provided is wrong the translation will still occur, but with corrupted
> results.
>
> I would like to be able to audit our data to understand if it is
> encoded properly or not. Ideally I would like to be able to do this
> audit using SQL. At one point I had thought that I had even
> identified an algorithm that could do such a thing:
>
> If lengthb(to_multi_byte(fieldx))/length(to_multi_byte(fieldx)) <> 3
> then encoding error
> else no encoding error
>
> This seemed to work well, but then I discovered some extended ascii
> data that it did not work with: Ö, Å, ý, etc. It does however work
> with some Spanish extended ASCII and all of the Asian Multi-byte that
> I have encountered so far. To be honest I do not completely
> understand how all the functions are working (especially
> to_multi_byte()). I think I may be on the right track, but someone
> with expertise may able to save me some time and frustration.
>
> Thanks,
> Gary
Received on Fri Mar 28 2003 - 22:29:36 CET

Original text of this message