Dear Newsgroup
I have a question about case sensitivity in a table. I've
reached the limit of my elementary RDB knowledge and would
very much appreciate any help.
I've developed a Table datatype for a new computer language.
This stores rows of data, columns can have any of the usual
numeric and string types, one or more indexes can be defined
using one or more column(s), and there are operators to add,
delete and find rows in O(log(N)) time, and all table access
is done using keys. Strings are compared in a case sensitive way
using their ASCII code, by treating them as variable length
binary arrays. The table type follows most of the relational
requirements (as applicable to one table) and I've got the
code working OK.
This table type is used as a symbol table for the language's
compiler. The language is case insensitive, so I always convert
symbols to lower case before adding them to the symbol table.
If I didn't do this, then 'Foo' and 'foo' would be two separate
symbols, which is not allowed for a case insensitive language.
Again, this all works fine now.
But now I want to change my compiler to preserve the case
of the symbols in my symbol table. If the user declares an
identifier 'fOO' then 'fOO' goes in the symbol table, but
if a duplicate i/d 'foo' is used then 'foo' must not be
added. Adding the symbol is no problem, all I have to do
is preserve its case. But there are problems when doing
symbol look-up. I can see several options:
(1) If the symbol being checked is NewId, and SYMTAB.id is
the symbol table column storing the identifier, do the
equivalent of
SELECT * FROM SYMTAB WHERE UCASE(id) = UCASE(NewId)
But this is very inefficient, because UCASE(id) is not an
index so the entire table would have to be scanned each time.
And there can be thousands of symbols and thousands of lookups.
(2) Modify the index operators (which use a balanced tree) to
use case insensitive string comparisons. But I'm not sure this
would even work. Imagine a symbol table with 3 symbols, ordered
in a case sensitive way (using ASCII):
John
Richard
jon
A case insensitive binary search for 'jon' would fail, since the
first comparison 'jon' < 'Richard' would return True, then
'jon' < 'John' would return False, and the search code would
decide 'jon' isn't in the table.
(3) Define a new case insensitive index. For the example above,
it would contain:
JOHN (row = 1)
RICHARD (row = 2)
JON (row = 3)
Searches would then use this special index, which has row numbers
back to the original symbol table.
After some thought, I realized the problem is a general one of how
to preserve the case sensitivity in the table, while allowing case
insensitive and case sensitive look-ups. (using a compiler's symbol
table as an example is not significant, it just happened to be the
first time this problem occurred to me).
At this point I got stuck and did some searches on the net. I found
an IBM paper on DB2 where they define a special column which
always stores the upper case value of another column, then they
index the upper-cased column. Also I found some info about Microsoft
SQL server, and it seems you have to decide whether strings are
case insensitive when installing SQL server. So I can't find a
consistent answer.