Re: Case sensitivity = property of column, table or database?

From: Bill Rayer <william.rayer_at_virgin.net>
Date: Tue, 16 Apr 2002 11:18:34 +0100
Message-ID: <dUSu8.28884$tZ1.6959574_at_news2-win.server.ntlworld.com>


Hi there

> >Thanks for the quick response (from you and Anton)!
> >
> >Reading your replies, and based on other info I've gathered, a
> >separate upper-cased column seems to be the best option. This can be
> >indexed to ensure quick access. Also it should be unique, since the
> >language is case insensitive (ie foo and Foo should never be in the
> >symbol table together).
>
> It's a little hard to believe that you want to make tables of only
> identifiers of your own new language ... if that's the case, I would
> check the validity of the identifier (case insensitively) BEFORE
> storing it in the table, then access it using the least expensive
> means.

Sorry, I may not have explained the purpose of the table very well. In the context of my problem with case sensitivity, it is used as a compiler symbol table. It has 20 columns or so, storing the identifier folded to lower case, scope level, 'sort' of the identifier (constant, function, local, input parameter, module name etc), signature lists for functions, parent type for types and values, line number of declaration, and others I forget. The number of rows in the table varies as the compiler runs, since identifiers are added and deleted as they fall in and out of scope. Typically there are 1000 identifiers in the table (this includes the run time library i/ds).

The table is also used as a datatype in the language. In this context the number of rows and columns depends on the program written in the language. I may have confused the issue by first discussing the table in this context. It is only the case sensitivity issue I am concerned with, though, and it may be best to forget any other purpose of the table for now.

As you suggest, I do check the validity of the identifier in a case insensitive way, then iff it is OK it is stored in the table, folded to lower case. This allows efficient O(log(N)) access, since the identifier column is indexed, and whenever I want to check whether a new i/d is in the table, I convert the new i/d to lower case and do a case sensitive search using the lower-cased index. This is exactly what I do, and it works perfectly. The only problem is, I have lost the original case of the identifier. This means when the debugger displays symbols during a debug session, and gets the symbols from the symbol table, all the symbols appear in lower case.

> Since the number of identifiers of any given programming language
> typically never exceed a hundred or so, you'd be better off caching
> all your identifiers in memory and doing FTS (full-table scans)
> instead of some inefficient index reads which have to be retrieved
> from disk each time you wish to find an entry.

I am a bit lost here. Certainly the number of reserved words (things like 'if' in most languages) may never exceed a hundred, but there are typically thousands of identifiers. At least, there are using C, Pascal or VB under Windows which is what I'm familiar with. Also an index read is not too bad, surely? All index entries are guaranteed findable in O(log(N)) time. Also my symbol table can be stored in memory as well as on disk, and in memory it uses pointers instead of file I/O and is reasonably quick.

> Storing an additional column is resource-expensive; converting the
> data at runtime is much less expensive, but only as long as it is only
> converted once PER SEARCH INSTANCE. ;) (BTW you would need to convert
> it once, anyway, just to store the data in the extra column ... I am
> suggesting that you convert the data only when you need to search for
> it. This should produce minimal overhead, since disk reads are by far
> the most expensive part of any DBMS search operation).

Do you mean that the i/d column in the symbol table should store the i/d in mixed case, then before searching for a new i/d, I convert the whole column to upper case (say)? Or should the i/d column be indexed and the index stored in upper case?

> >Lastly, I am still puzzled why products like SQL server want to know
> >during installation whether they should be case sensitive. It seems
> >as if case sensitivity is most efficiently dealt with by having a new
> >column which can be indexed. So it is really a property of each string
> >column, not the entire database.<
>
> Other products, such as Oracle, only want to know the locale's code
> page ... there are additional options such as "function-based
> indices". In Oracle, unless otherwise specified, all text search is
> case-sensitive.

Does this mean SQL server is being daft in wanting to know at installation time? How does Oracle deal with case-insensitive searches then?

> May I ask, what RDBMS you have actually worked with?
>
> Have you ever had to deal with databases with millions of rows?

Only Access V2, and the table type I developed for the new language. And I have not dealt with more than 1000 rows or so. (In my first post I said my RDB experience was limited!) My viewpoint is that of a language designer, without much practical RDB experience, wanting to know the correct theoretical way to deal with case sensitivity.

All the best
Bill Rayer Received on Tue Apr 16 2002 - 12:18:34 CEST

Original text of this message