Re: Displaying 'umlaut' character

From: Peter J. Holzer <hjp-usenet2_at_hjp.at>
Date: Wed, 22 Sep 2010 09:36:59 +0200
Message-ID: <slrni9jcgt.9om.hjp-usenet2_at_hrunkner.hjp.at>



On 2010-09-22 06:01, Ben Morrow <ben_at_morrow.me.uk> wrote:
> Quoth "dn.perl_at_gmail.com" <dn.perl_at_gmail.com>:
>> My aim is to display the ‘special’ (NON-Ascii) German character/
>> diacritic umlaut or diaresis correctly on a browser. The browser calls
>> a cgi perl-script which resides on a linux server. The browser which
>> calls the perl-script displays Vietnamese characters correctly (but
>> not the umlaut) without any special setting. The script sets NLS_LANG
>> variable to AMERICAN_AMERICA.UTF8 and uses utf8 module, but that’s
>> about it.
>
> You almost certainly don't want to do either of those. 'use utf8' does
> exactly one thing: it tells Perl your script itself is written in UTF-8.
> If that isn't the case you don't want to use it. Perl also doesn't take
> any notice of NLS_LANG or any of the other locale envvars unless you ask
> it to (and, normally, that's a bad idea). However, it's possible that
> whatever database interface you're using does.
>
>> $ENV{'NLS_LANG'}='AMERICAN_AMERICA.UTF8';
>> Works for Vietnamese characters, but not with umlaut (ö).
>
> I don't think that's usually a valid locale on a Linux system. Usually
> they are of the form 'en_US.UTF-8', but in any case if you need locales
> at all you will want to check which locales are available on your
> system.

The NLS_LANG environment variable is for Oracle. He does need that if he wants to get anything but US-ASCII out of (or into) an Oracle database. AMERICAN_AMERICA.UTF8 is a valid locale for Oracle, but for Oracle 9 or later you should use .AL32UTF8 instead of .UTF8 (.AL32UTF8 is real UTF-8, .UTF8 is a weird mixture of UTF-8 and UTF-16).

>> But even before we get to a perl-script, perhaps the LC_CTYPE env
>> variable needs to be set correctly. From my windows laptop, if I
>> access Oracle through Oracle Query Server, I can see the umlaut. But
>> if I open a linux-window,

Whatever "a linux window" may be. Putty? An X server? A VM running on the windows host? Whatever it is, NLS_LANG must match the character set used by the terminal emulator.

>> initiate an sqlplus session, and run the same SQL, I do not see the
>> umlaut correctly. I have tried a few values for the env variable
>> LC_CTYPE (like iso_8859_1, en_US, en_US.iso88591), but with no luck.
>> The surprising thing is that ‘umalut’ is a muck-known alphabet,
>> Vietnamese alphabets are less- known. Yet the Vietnamese characters
>> are being displayed correctly.
>>
>> What settings should I use in a perl-script or for a linux-window to
>> see the umlaut correctly? Please advise.
>
> OK. What is actually stored in the database (what data types are you
> using, and how is the data encoded before being stored)? How are you
> getting the data out of the database (the only correct answer here is
> 'DBI', or possibly a wrapper around that)? Have you read the DBI and
> DBD::Oracle docs for anything concerning character encodings? Have you
> read perlunitut and the other docs that refers you to?
>
> FWIW when I do this sort of thing I use Postgres with DBD::Pg, I set the
> database encoding to UTF-8 (this is a Pg-specific feature, but I
> wouldn't be surprised if Ora has got something similar),

DBD::Oracle does this if NLS_LANG includes a UTF-8-like character set. Since he has set that correctly he gets wide characters back from the database. The umlauts all have character codes <= 0xFF, so they can be printed as a single byte and perl does that. The vietnamese characters have codes >= 0x0100, so Perl converts them to UTF-8 (I bet he has a lot of "Wide character in print" warnings in log file).

> I push an :encoding(utf8) layer onto any filehandles, I make sure to
> send a 'Content-type: text/html; charset=utf-8' header, and everything
> Just Works. There are variations on that which work just as well, but
> that's by far the simplest approach.

ACK. The OP is probably missing the :encoding(utf8) layer.

        hp Received on Wed Sep 22 2010 - 02:36:59 CDT

Original text of this message