Verify Encoding in UTF8 Oracle DB

From: JeepGary <gary_at_51075.com>
Date: 28 Mar 2003 11:16:45 -0800
Message-ID: <ffbd7fa6.0303281116.75dbebcf_at_posting.google.com>


The DB that I work on receives data from many different batch files. The group importing the file will specify the encoding of the file, but some times they are wrong. Note that this is a global operation and the data can be Latin, Asian, etc. The encoding information is then used to translate the data to UTF8. If the encoding that was provided is wrong the translation will still occur, but with corrupted results.

I would like to be able to audit our data to understand if it is encoded properly or not. Ideally I would like to be able to do this audit using SQL. At one point I had thought that I had even identified an algorithm that could do such a thing:

If lengthb(to_multi_byte(fieldx))/length(to_multi_byte(fieldx)) <> 3 then encoding error
else no encoding error

This seemed to work well, but then I discovered some extended ascii data that it did not work with: Ö, Å, ý, etc. It does however work with some Spanish extended ASCII and all of the Asian Multi-byte that I have encountered so far. To be honest I do not completely understand how all the functions are working (especially to_multi_byte()). I think I may be on the right track, but someone with expertise may able to save me some time and frustration.

Thanks,
Gary Received on Fri Mar 28 2003 - 20:16:45 CET

Original text of this message