Re: nlssort

From: Howard J. Rogers <howardjr2000_at_yahoo.com.au>
Date: Sun, 17 Nov 2002 05:38:46 +1100
Message-ID: <G8wB9.77968$g9.219593@newsfeeds.bigpond.com>

Not that I've ever felt the need to use it myself, but if you are not happy with Oracle's own sorts, you can build your own. There's a utility provided in (I think) 8.1.6 and above that does just this... Locale Builder.

Might be worth checking out.

Regards
HJR <flatline_at_mailbox.hu> wrote in message
news:bc3a43ba.0211160749.36091c16_at_posting.google.com...
> Hi Peter!
>
> Thank you for your interest. I'm a collegue of Hegyvari Krisztian and
> i'm pleased to share my information about oracle nls sorting with you.
> First of all, correct hungarian sort must follow the rules that 'Á' is
> after 'A', 'É' is after 'E', etc. The rule that 'GY' follows 'G' is
> usally not expected in computer programs, but expected in
> dictionaries. Quite bad, that Oracle can not handle proper nls sorting
> due to a very stange .. hmm.. bug.
> Interesting that oracle can perfectly sort ONE characters using
> NLSSORT.
> Moreover it handles 'GY' type rules when NLS_SORT=XHUNGARIAN is used,
> handling it as a single character.
> The problem begins when sorting strings longer than 1 letter.
> We've dug in the problem deeply, and found that returned value from
> nlssort function weights differences BADLY.
> Here are the facts: (nlssort values based on NLS_SORT=HUNGARIAN)
> (Oracle 8.1.7 if i'm not wrong)
>
> 14000100 - A
>
> 14000300 - Á
>
> 19000100 - B
>
> 1E000100 - C
>
> 141E00010100 - AC
> Bitmask:
> 1400**0100 - Modified A
> ..1E00**0100 - Modified C
>
> 141900030100 - ÁB
> 1400**0300 - Modified Á
> ..1900**0100 - Modified B
>
> Interpreting the output of nlssort (of a single character) as a
> doubleword value, you can see nlssort values (of a string) is
> calculated as follows:
> each character's nlssort values is modified by inserting an empty byte
> between the low and high word value (for each length over 1) (marked
> with *) and shifting each character's value with a byte in order
> corresponding to the string (marked with .)
> Therefore:
> 1414190001030100 - AÁB
> 1400****0100 - A
> ..1400****0300 - Á
> ....1900****0100 - B
>
> As you can see, here AC is greater than ÁB, but only because the
> second character's nlssort mask is more dominant than the difference
> of first character beeing A or Á. Therefore the following (correct)
> hungarian sort
> AB
> AD
> AF
> AH
> ÁC
> ÁE
> ÁG
> ÁI
> ordered by oracle will look like:
> AB
> ÁC
> AD
> ÁE
> AF
> ÁG
> AH
> ÁI
> which is BAD. Really bad. I'm really curious whether the same anomaly
> can be experienced in other middle-europian languages. (Sorting one
> character correctly, more than one badly)
> Anyway, nlssort based indexes take up more than 3 times the space than
> the data itself :(
>
> PS: sorry for my english, I read a lot more than I write ;)
>
> Byez,
> Flatline
>
> pgulutzan_at_ocelot.ca (Peter Gulutzan) wrote in message
news:<36c478c6.0211151625.7b79a3b7_at_posting.google.com>...
> > hegyvari_at_ardents.hu (Hegyvari Krisztian) wrote in message
news:<9003d46f.0211100219.120c38a0_at_posting.google.com>...
> > > Greetings,
> > >
> > > I run a 8.1.7. database with the following settings:
> > >
> > > nls_lang=HUNGARIAN
> > > nls_sort=HUNGARIAN
> > > nls_comp=ANSI
> > >
> > > The correct alphabetical order in the Hungarian language is A,Á,B,....
> > > and for basic queries Oracle knows it, I can verify it using queries
> > > like "select * from dual where 'A'>'Á'" which gives no results and
> > > with 'B'>'Á' which results in an X. So far so good. The same query
> > > with 'BÁ'>'BA' results in an X which is good. 'BÁ'>'BAC' gives no
> > > results, which is bad, since Á is the second character in the string,
> > > while C is the third, according to how I order strings BÁ would come
> > > later than BAC. If I check nlssort('BÁ','NLS_SORT=HUNGARIAN') and
> > > nlssort('BAC','NLS_SORT=HUNGARIAN'), the second is greater, I think
> > > that is what causes this misbehaviour.
> > >
> > > Do I know something wrong or is this a bug which may have been
> > > corrected already?
> > >
> > > Any answers are highly appreciated, please cc it to
> > > hegyvari_at_ardents.hu.
> > >
> > > Regards,
> > >
> > > Hegyvari Krisztian
> >
> > I had always understood that the Hungarian exceptional rules are: Á
> > may follow A, CS follows C, Ê may follow E, Ö follows O, Ü follows U,
> > ZS follows Z, and the digraphs (GY, LY, NY, SZ TY) are separate
> > characters but follow (G, L, N, S, T) anyway. The words "may follow"
> > would imply that the rule is optional, followed in some
> > dictionaries/lists but not all. So I too would be interested in
> > knowing which rules Oracle follows.
> >
> > For several other European languages and an overview of what DB2 and
> > Oracle and SQL Server do with them, see my article "SQL Collations" at
> > http://dbazine.com/gulutzen1.html.
> >
> > Peter Gulutzan
> > Co-Author of SQL Peformance Tuning (http://www.ocelot.ca/tuning.htm)
Received on Sat Nov 16 2002 - 12:38:46 CST