Re: Displaying 'umlaut' character

From: Ben Morrow <ben_at_morrow.me.uk>
Date: Wed, 22 Sep 2010 09:19:25 +0100
Message-ID: <dpqom7-io5.ln1_at_osiris.mauzo.dyndns.org>


Quoth "Peter J. Holzer" <hjp-usenet2_at_hjp.at>:
> On 2010-09-22 06:01, Ben Morrow <ben_at_morrow.me.uk> wrote:
> >
> > You almost certainly don't want to do either of those. 'use utf8' does
> > exactly one thing: it tells Perl your script itself is written in UTF-8.
> > If that isn't the case you don't want to use it. Perl also doesn't take
> > any notice of NLS_LANG or any of the other locale envvars unless you ask
> > it to (and, normally, that's a bad idea). However, it's possible that
> > whatever database interface you're using does.
> >
> >> $ENV{'NLS_LANG'}='AMERICAN_AMERICA.UTF8';
> >> Works for Vietnamese characters, but not with umlaut (ö).
> >
> > I don't think that's usually a valid locale on a Linux system. Usually
> > they are of the form 'en_US.UTF-8', but in any case if you need locales
> > at all you will want to check which locales are available on your
> > system.
>
> The NLS_LANG environment variable is for Oracle. He does need that if he
> wants to get anything but US-ASCII out of (or into) an Oracle database.
> AMERICAN_AMERICA.UTF8 is a valid locale for Oracle, but for Oracle 9 or
> later you should use .AL32UTF8 instead of .UTF8 (.AL32UTF8 is real
> UTF-8, .UTF8 is a weird mixture of UTF-8 and UTF-16).

Ah, I see. (I don't use Oracle.) I was getting confused with NLSPATH used by catgets(3), I think.

Weird choice of environment variable: I would expect something prefixed with OC8 or some such. <shrug> I guess it's just part of the 'we own the whole world' Oracle mentality... :)

> > FWIW when I do this sort of thing I use Postgres with DBD::Pg, I set the
> > database encoding to UTF-8 (this is a Pg-specific feature, but I
> > wouldn't be surprised if Ora has got something similar),
>
> DBD::Oracle does this if NLS_LANG includes a UTF-8-like character set.

In Pg this is a per-database setting indicating how the strings are stored as well as how they are returned by default; asking for per-connection on-the-fly reencoding is different. (Not really important here, I know.)

> Since he has set that correctly he gets wide characters back from the
> database. The umlauts all have character codes <= 0xFF, so they can be
> printed as a single byte and perl does that. The vietnamese characters
> have codes >= 0x0100, so Perl converts them to UTF-8 (I bet he has a lot
> of "Wide character in print" warnings in log file).

Yup. This presumably means he *is* correctly sending the charset Content-type parameter, otherwise the situation would be exactly reversed.

Ben Received on Wed Sep 22 2010 - 03:19:25 CDT

Original text of this message