10gR2 new feature: Case Insensitive Sorts & Compares

Natalka Roshak's picture
articles: 

One handy new feature of 10gR2 is its vastly expanded ability to do case-insensitive sorts and compares (a subset of its expanded ability with other special sorts and compares, such as special linguistic sorts).

To get us started, here's a simple sample table with a few rows in it.

SQL> select * from invoice_demo order by invoice_id
SQL> /

INVOICE_ID SUPPLIER_NAME
---------- ------------------------------
      1000 Max Books
      1001 MAX BOOKS
      1002 max books
      1003 janus pet supply

The default: Binary sort

The default sort order is always binary, meaning that the characters are sorted in the order of their character number in the character set (aka code point or encoded value). For example, compare the sort order on SUPPLIER_NAME with the character number of the first character in the field:

SQL> select invoice_id, supplier_name, ascii(substr(supplier_name,1,1))
  2  from  invoice_demo 
  3  order by supplier_name
SQL> /

INVOICE_ID SUPPLIER_NAME        ASCII(SUBSTR(SUPPLIER_NAME,1,1))
---------- -------------------- --------------------------------
      1001 MAX BOOKS                                          77
      1000 Max Books                                          77
      1003 janus pet supply                                  106
      1002 max books                                         109

7 rows selected.

All the lower-case characters occur after all the upper-case characters in ASCII, WE89ISO8859P1, and other Latin character sets, so 'M' comes before 'j'. A pretty common requirement is to do a case-insensitive sort, where 'j' would come before 'M'. Oracle has offered various ways to do this since at least 9i, but 10gR2 improves on the offering significantly.

The old way: NLS_COMP=ANSI

In all versions since at least 9i, you can do a case insensitive sort by setting NLS_COMP to ANSI and choosing a case insensitive sort order. (The default setting for NLS_COMP is BINARY, which produces the binary sort we saw in the first query.)

Choosing a case-insensitive sort order is easy. You can take any of the Oracle-defined linguistic sort orders - the default is BINARY - and append "_CI" to it. For example:

SQL> alter session set NLS_COMP='ANSI' ;

Session altered.

SQL> alter session set NLS_SORT='BINARY_CI' ;

Session altered.

SQL> select * from invoice_demo order by supplier_name ;

INVOICE_ID SUPPLIER_NAME
---------- ------------------------------
      1003 janus pet supply
      1001 MAX BOOKS
      1000 Max Books
      1002 max books

Setting NLS_COMP to ANSI causes Oracle to use the sort order specified in NLS_SORT when doing an ORDER BY. But one big limitation is that when NLS_COMP is set to ANSI, only certain SQL functions and operations will use the NLS_SORT sort order. The rest will still use the default BINARY sort order. For example, the = (equality) operator will do a case-insensitive comparison, but the "like" operator will not:

SQL> select * from invoice_demo
  2  where supplier_name='MAX BOOKS';

INVOICE_ID SUPPLIER_NAME
---------- ------------------------------
      1000 Max Books
      1001 MAX BOOKS
      1002 max books

SQL> select * from invoice_demo where supplier_name like 'M%';

INVOICE_ID SUPPLIER_NAME
---------- ------------------------------
      1000 Max Books
      1001 MAX BOOKS

And neither will most other SQL functions and operators; for example:

SQL> select min(supplier_name) from invoice_demo ;

MIN(SUPPLIER_NAME)
------------------------------
MAX BOOKS

SQL> select distinct (supplier_name) from invoice_demo
  2  where substr(supplier_name,1,1)='M' ;

SUPPLIER_NAME
------------------------------
Max Books
max books
MAX BOOKS

The 10gR2 solution: NLS_COMP=LINGUISTIC

10gR2 introduced a new value for NLS_COMP which extends the NLS_SORT sort order to cover all SQL sorts and comparisons. Here are the results of the same queries as above, with NLS_COMP set to LINGUISTIC.

SQL> alter session set nls_comp='LINGUISTIC';

Session altered.

SQL> select * from invoice_demo
  2  where supplier_name like 'M%';

INVOICE_ID SUPPLIER_NAME
---------- ------------------------------
      1000 Max Books
      1001 MAX BOOKS
      1002 max books

SQL> select min(supplier_name) from invoice_demo ;

MIN(SUPPLIER_NAME)
------------------------------------------
janus pet supply

SQL> select distinct (supplier_name) from invoice_demo
  2  where substr(supplier_name,1,1)='M' ;

SUPPLIER_NAME
------------------------------
Max Books

With NLS_COMP set to LINGUISTIC, every SQL sort and comparison uses the NLS_SORT linguistic sort order. So, if you have set your sort order to a case insensitive one by appending _CI, all sorts and comparisons in that session will be case insensitive. Very handy!

Mixing Case-Sensitive Searches with Case-Insensitive Ones

You might want to mix case-sensitive and case-insensitive behaviour in the same query. For example, I might want to group case-sensitively and sort case-insensitively. First I'll add a bit more data to the table to make this example clearer:

SQL> select * from invoice_demo
  2  where supplier_name like 'm%'
  3  order by invoice_id

INVOICE_ID SUPPLIER_NAME
---------- -------------------------------
      1000 Max Books
      1001 MAX BOOKS
      1002 max books
      1010 MAX BOOKS
      1011 max books

In this example, I want to know how many occurrences of each supplier_name show up in the table, case-sensitively, to see how the data is being entered; but I still want 'j' to come before 'M' in the results. So I want to group case-sensitively, and sort case-insensitively. The easiest way to do this is to use the SQL function NLSSORT. This function takes a string and an NLS sort order, and returns a hex value. The following example shows the values returned by NLSSORT for the BINARY sort order, and for the BINARY_CI sort order, side-by-side.

SQL> select invoice_id, supplier_name,
  2  nlssort(supplier_name,'NLS_SORT=BINARY'),
  3  nlssort(supplier_name,'NLS_SORT=BINARY_CI')
  4  from invoice_demo
  5  where supplier_name like 'm%'
  6  order by invoice_id

SQL> /

INVOICE_ID SUPPLIER_NAME        NLSSORT(SUPPLIER_NAME,'NLS_SORT=BIN NLSSORT(SUPPLIER_NAME,'NLS_SORT=BIN
---------- -------------------- ----------------------------------- -----------------------------------
      1000 Max Books            4D617820426F6F6B7300                6D617820626F6F6B7300
      1001 MAX BOOKS            4D415820424F4F4B5300                6D617820626F6F6B7300
      1002 max books            6D617820626F6F6B7300                6D617820626F6F6B7300
      1010 MAX BOOKS            4D415820424F4F4B5300                6D617820626F6F6B7300
      1011 max books            6D617820626F6F6B7300                6D617820626F6F6B7300

5 rows selected.

This function makes it easy to combine case-sensitive and case-insensitive sorts and comparisons in the same query. In this example, the SQL would be:

SQL> alter session set nls_comp='binary' ;

Session altered.

SQL> alter session set nls_sort='binary' ;

Session altered.

SQL> select supplier_name, count(*)
  2  from invoice_demo
  3  group by supplier_name
  4  order by nlssort(supplier_name,'NLS_SORT=BINARY_CI') ;

SUPPLIER_NAME          COUNT(*)
-------------------- ----------
janus pet supply              1
MAX BOOKS                     2
max books                     2
Max Books                     1

4 rows selected.

Case Insensitive Indexes

A normal index uses the default sort order, by default, and so it's useless in a case-insensitive search - and Oracle won't use it. For large tables, the resulting full table scan can be quite a performance penalty. Fortunately, it's easy to create an index that uses a specific sort order. You simply create a function-based index that uses the NLSSORT function we saw above.

SQL> create index ind_idemo_sname_ci
  2  on invoice_demo (nlssort(supplier_name, 'NLS_SORT=BINARY_CI'))
SQL> /

Index created.

The main thing to watch out for here is that because this is a function-based index, it is subject to all the restrictions on function-based indexes. For example, Oracle will not use the index if it's theoretically possible for the result set to include NULLS in the indexed column. Oracle will only use the index if either (1) the column that the index is on is declared as NOT NULL, or (2) you have a where clause in the query that specifies that the function on the indexed column return NOT NULL, such as WHERE NLSSORT(SUPPLIER_NAME,'NLS_SORT=BINARY_CI') IS NOT NULL .

Accent Insensitivity

A related feature is what one might call "accent insensitivity," or the ability to ignore diacritics in the sort order. For example, you might want "à la mode" (note the accent over the "a") and "a la mode" (no accent) to sort to the same position. Just as you can append _CI to any of Oracle's linguistic sort orders, you can append _AI to get case insensitivity and accent insensitivity.

For this example, we'll need some data with accented characters:

INVOICE_ID SUPPLIER_NAME
---------- --------------------
      2002 École Lyonnaise
      2000 ecole lyonnaise
      2001 école lyonnaise

In WE8ISO8859P1, and other Latin character sets, characters with diacritics come after all of the regular ASCII characters. So the "E"s with accents will sort after everything else in the default binary sort order.

SQL> alter session set nls_sort='BINARY';

Session altered.

SQL> select supplier_name from invoice_demo order by supplier_name ;

SUPPLIER_NAME
--------------------
MAX BOOKS
MAX BOOKS
Max Books
ecole lyonnaise
janus pet supply
max books
max books
École Lyonnaise
école lyonnaise

9 rows selected.

If we use the BINARY_CI sort order, "ecole lyonnaise" will sort before "Max Books", but "école lyonnaise" will still come last.

SQL> alter session set nls_sort='BINARY_CI' ;

Session altered.

SQL> select supplier_name
  2  from invoice_demo order by supplier_name ;

SUPPLIER_NAME
--------------------
ecole lyonnaise
janus pet supply
Max Books
max books
max books
MAX BOOKS
MAX BOOKS
École Lyonnaise
école lyonnaise

9 rows selected.

But if we use the "accent insensitive" sort order BINARY_AI, we get both case and accent insensitivity:

SQL> alter session set nls_sort='BINARY_AI' ;

Session altered.

SQL> select supplier_name
  2  from invoice_demo order by supplier_name ;

SUPPLIER_NAME
--------------------
ecole lyonnaise
École Lyonnaise
école lyonnaise
janus pet supply
max books
Max Books
max books
MAX BOOKS
MAX BOOKS

9 rows selected.

Other sort orders

The new NLS_COMP=LINGUISTIC setting doesn't just apply to case-insensitive and accent-insensitive binary searches. Any of Oracle's linguistic sort orders can be used with it. For example, you can choose NLS_SORT=XSPANISH to force all SQL to use traditional Spanish sort and comparison semantics, in which "ch" and "ll" are both treated as one character - so "llaves" does not meet the condition "like 'l%' ", for example. And you can apply the _CI and _AI suffixes to any linguistic sort order, eg. NLS_SORT=XSPANISH_CI .

For more information

Refer to Chapter 5 of the Oracle 10gR2 Database Globalization Support Guide.

About the author

Natalka Roshak is a senior Oracle and Sybase database administrator, analyst, and architect. She is based in Kingston, Ontario, and consults across North America. More of her scripts and tips can be found in her online DBA toolkit at http://toolkit.rdbms-insight.com/.

Comments

Hello,

Thank you in advance for any help on this. I have converted and access db to oracle using vb. All the data is an exact match, the issue is that there are hundreds inbedded selects that are not returning rows because of the case sensativity Oracle brings to the table.

My question is:

Is the a master switch to turn off case sensativity?

Thx again for any help.

Lee

p.s. Let me know when you are in the D.C. area, I will take you out to lunch or dinner.