RE: Database Character set related question

From: Carel-Jan Engel <cjpengel.dbalert_at_xs4all.nl>
Date: Fri, 28 Oct 2005 22:33:01 +0200
Message-Id: <1130531581.18868.14.camel@dbalert199.dbalert.nl>

Actually, there is some (UNSUPPORTED) workaround to get the US7ASCII characterset imported in an 8bit way.
I'd suggest you start reading the (in)famous TOTOT book (also know as 'Tales of the Oak Table', Apress) at page 269, where James Morle explains how they did this by patching the characterset byte in the header of the export file (export 7bit, leaves the 8th bit untouched, import 8-bit). He doesn't explain what the byte settings should be, but that must be fairly easy to find out. Create two toy databases with either (US7ASCII/WE8something) characterset, create an empty schema and export the schema in both of them. The use 'od' to find out the differences. The trick is that the import program thinks it has to import 8-bit characters, and will 'forget' to mask the 8th bit. It's all pretty straightforward, I'd give it a shot in a test-environment. James also used a streaming setup, using 7 parallel export sessions, piping their output through named pipes/gzip/network/remote shell/gzip/named pipes/import to the destination. Somewhere they plumbed in the little C-program that converted the characterset byte in the header into the byte expected by the import process.

Of course this was in Oracle 7 days, so you should test whether the exp/imp behaviour is still the same.

The book is worth every page it's printed on (IMHO), and if you hadn't a reason yet to buy it, you've got one now.

Best regards,

Carel-Jan Engel

===
If you think education is expensive, try ignorance. (Derek Bok) ===

On Fri, 2005-10-28 at 13:52 -0600, Justin Cave (DDBC) wrote:
> Having 8-bit values in CHAR/ VARCHAR2 columns in a database whose
> character set us US7ASCII is not supported by Oracle and can cause
> significant problems, including irretrievably corrupt data.
>
>
>
> If the clientÿs NLS_LANG and the serverÿs character set match, Oracle
> currently bypasses the character set conversion step at the network
> level (it has been threatened that this will change in the future) and
> trusts that the data being sent is, in fact, the declared character
> set. If every application that accesses the data claims that the data
> is US7ASCII but treats it internally as some other character set, it
> can appear that things are working correctly. If any application or
> client machine deviates from that falsehood, very bad things happen.
> If you have two different applications that use different character
> sets internally (i.e. one is using UTF8 and the other is using
> Windows-1252), you can have a situation where half of your data is
> encoded using UTF-8 and the other half is encoded in Windows-1252 and
> Oracle has no way of knowing what values are encoded in what character
> set. Untangling data that has been corrupted this way is a challenge
> to say the least.
>
>
>
> You cannot use export & import to move data to a new machine with a
> different character set. If you do, Oracle will do a character set
> conversion at some point and corrupt your data, since it will assume
> that the data is valid US7ASCII and the conversion will fail for
> values > 128.
>
>
>
> The first step is to figure out what character set the data in the
> database really is. You can do this either by looking at the
> application(s) that connect to the database and seeing what character
> set they use internally or by examining the data. The quick & dirty
> approach is to extract the data (or at least representative subsets)
> to a flat file (ensuring that you do not do a character set
> conversion) and try to open the file in Word, which will hopefully
> prompt you to choose a character set to use to display the data. Play
> around until you find one that works (probably Windows-1252,
> ISO-8859-1, ISO-8859-15, or UTF-8). Note that many of these character
> sets are very similar, so you have to be careful about jumping to
> conclusions. If there are multiple character sets represented, youÿve
> got a major problem.
>
>
>
> If you follow the general instructions for character set conversion
> (i.e. ALTER DATABASE CHARACTER SET), Oracle will allow you to change
> the character set to just about anything, but that wonÿt change the
> data that is being stored and will almost certainly cause massive
> amounts of things to break, particularly if you pick the wrong
> character set. If you pick the right character set (the character set
> that the data is really encoded in), youÿll just need to adjust
> NLS_LANG settings on all the client machines. If you pick the wrong
> character set, conversion for the code points > 127 will universally
> fail regardless of client settings.
>
>
>
> Justin Cave
>
> Distributed Database Consulting, Inc.
>
>
>
>
>
> ______________________________________________________________________
>
> From:oracle-l-bounce_at_freelists.org
> [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Mandal, Ashoke
> Sent: Thursday, October 27, 2005 12:42 PM
> To: oracle-l_at_freelists.org
> Subject: Database Character set related question
>
>
>
>
>
> Greetings All,
>
> The database CHARACTER SET is US7ASCII. The data in the database
> contains 8 bits (> 127 ASCII ) values. We cannot display these
> values, instead we see "?" character via the sqlplus session.
> We want to upgrade this database to 9.2.0.4, but before we do
> that , we want to make sure that we don't lose any characters that
> are currently stored in the database.
> We are attempting to run CSSCAN to verify. We want to use UTF8 in the
> new 9i
> database. How do we determine what the character set was used to
> put the data into the database.
>
> It's definitely not US7ASCII!
>
> Is there a object that we can look at to see what another session is
> currently using? It is difficult for us to get access to the user 's
> workstation to see there configuration.
>
> Thanks,
>
> Ashoke
>

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Oct 28 2005 - 15:35:55 CDT