Re: Case sensitivity = property of column, table or database?

From: Anton Versteeg <Anton_Versteeg_at_nl.ibm>
Date: Mon, 15 Apr 2002 14:50:09 +0200
Message-ID: <3CBACC81.620ED0AB_at_nl.ibm>


Perhaps you can have the column twice:
once in mixed case and once (indexed) in uppercase?

Bill Rayer wrote:

> Dear Newsgroup
>
> I have a question about case sensitivity in a table. I've
> reached the limit of my elementary RDB knowledge and would
> very much appreciate any help.
>
> I've developed a Table datatype for a new computer language.
> This stores rows of data, columns can have any of the usual
> numeric and string types, one or more indexes can be defined
> using one or more column(s), and there are operators to add,
> delete and find rows in O(log(N)) time, and all table access
> is done using keys. Strings are compared in a case sensitive way
> using their ASCII code, by treating them as variable length
> binary arrays. The table type follows most of the relational
> requirements (as applicable to one table) and I've got the
> code working OK.
>
> This table type is used as a symbol table for the language's
> compiler. The language is case insensitive, so I always convert
> symbols to lower case before adding them to the symbol table.
> If I didn't do this, then 'Foo' and 'foo' would be two separate
> symbols, which is not allowed for a case insensitive language.
> Again, this all works fine now.
>
> But now I want to change my compiler to preserve the case
> of the symbols in my symbol table. If the user declares an
> identifier 'fOO' then 'fOO' goes in the symbol table, but
> if a duplicate i/d 'foo' is used then 'foo' must not be
> added. Adding the symbol is no problem, all I have to do
> is preserve its case. But there are problems when doing
> symbol look-up. I can see several options:
>
> (1) If the symbol being checked is NewId, and SYMTAB.id is
> the symbol table column storing the identifier, do the
> equivalent of
>
> SELECT * FROM SYMTAB WHERE UCASE(id) = UCASE(NewId)
>
> But this is very inefficient, because UCASE(id) is not an
> index so the entire table would have to be scanned each time.
> And there can be thousands of symbols and thousands of lookups.
>
> (2) Modify the index operators (which use a balanced tree) to
> use case insensitive string comparisons. But I'm not sure this
> would even work. Imagine a symbol table with 3 symbols, ordered
> in a case sensitive way (using ASCII):
>
> John
> Richard
> jon
>
> A case insensitive binary search for 'jon' would fail, since the
> first comparison 'jon' < 'Richard' would return True, then
> 'jon' < 'John' would return False, and the search code would
> decide 'jon' isn't in the table.
>
> (3) Define a new case insensitive index. For the example above,
> it would contain:
>
> JOHN (row = 1)
> RICHARD (row = 2)
> JON (row = 3)
>
> Searches would then use this special index, which has row numbers
> back to the original symbol table.
>
> After some thought, I realized the problem is a general one of how
> to preserve the case sensitivity in the table, while allowing case
> insensitive and case sensitive look-ups. (using a compiler's symbol
> table as an example is not significant, it just happened to be the
> first time this problem occurred to me).
>
> At this point I got stuck and did some searches on the net. I found
> an IBM paper on DB2 where they define a special column which
> always stores the upper case value of another column, then they
> index the upper-cased column. Also I found some info about Microsoft
> SQL server, and it seems you have to decide whether strings are
> case insensitive when installing SQL server. So I can't find a
> consistent answer.
>
> So to get to my question finally: If I store strings in a table
> using mixed case, and I want to be able to search for them using
> a case insensitive search, how can I get efficient O(log(N)) access?
>
> All ideas are very welcome!
> Regards
> Bill Rayer

--
Anton Versteeg
DB2 Specialist
IBM Netherlands
Received on Mon Apr 15 2002 - 14:50:09 CEST

Original text of this message