Case sensitivity = property of column, table or database?

From: Bill Rayer <william.rayer_at_virgin.net>
Date: Mon, 15 Apr 2002 13:24:40 +0100
Message-ID: <oEzu8.19800$tZ1.6004145_at_news2-win.server.ntlworld.com>


Dear Newsgroup

I have a question about case sensitivity in a table. I've reached the limit of my elementary RDB knowledge and would very much appreciate any help.

I've developed a Table datatype for a new computer language. This stores rows of data, columns can have any of the usual numeric and string types, one or more indexes can be defined using one or more column(s), and there are operators to add, delete and find rows in O(log(N)) time, and all table access is done using keys. Strings are compared in a case sensitive way using their ASCII code, by treating them as variable length binary arrays. The table type follows most of the relational requirements (as applicable to one table) and I've got the code working OK.

This table type is used as a symbol table for the language's compiler. The language is case insensitive, so I always convert symbols to lower case before adding them to the symbol table. If I didn't do this, then 'Foo' and 'foo' would be two separate symbols, which is not allowed for a case insensitive language. Again, this all works fine now.

But now I want to change my compiler to preserve the case of the symbols in my symbol table. If the user declares an identifier 'fOO' then 'fOO' goes in the symbol table, but if a duplicate i/d 'foo' is used then 'foo' must not be added. Adding the symbol is no problem, all I have to do is preserve its case. But there are problems when doing symbol look-up. I can see several options:

(1) If the symbol being checked is NewId, and SYMTAB.id is
the symbol table column storing the identifier, do the equivalent of

  SELECT * FROM SYMTAB WHERE UCASE(id) = UCASE(NewId)

But this is very inefficient, because UCASE(id) is not an index so the entire table would have to be scanned each time. And there can be thousands of symbols and thousands of lookups.

(2) Modify the index operators (which use a balanced tree) to
use case insensitive string comparisons. But I'm not sure this would even work. Imagine a symbol table with 3 symbols, ordered in a case sensitive way (using ASCII):

  John
  Richard
  jon

A case insensitive binary search for 'jon' would fail, since the first comparison 'jon' < 'Richard' would return True, then 'jon' < 'John' would return False, and the search code would decide 'jon' isn't in the table.

(3) Define a new case insensitive index. For the example above,
it would contain:

  JOHN (row = 1)
  RICHARD (row = 2)
  JON (row = 3)

Searches would then use this special index, which has row numbers back to the original symbol table.

After some thought, I realized the problem is a general one of how to preserve the case sensitivity in the table, while allowing case insensitive and case sensitive look-ups. (using a compiler's symbol table as an example is not significant, it just happened to be the first time this problem occurred to me).

At this point I got stuck and did some searches on the net. I found an IBM paper on DB2 where they define a special column which always stores the upper case value of another column, then they index the upper-cased column. Also I found some info about Microsoft SQL server, and it seems you have to decide whether strings are case insensitive when installing SQL server. So I can't find a consistent answer.

So to get to my question finally: If I store strings in a table using mixed case, and I want to be able to search for them using a case insensitive search, how can I get efficient O(log(N)) access?

All ideas are very welcome!
Regards
Bill Rayer Received on Mon Apr 15 2002 - 14:24:40 CEST

Original text of this message