Re: Problem with Case Sensitivity in Oracle?

From: Paul Singleton <paul_at_cs.keele.ac.uk>
Date: 4 Mar 1994 16:09:26 -0000
Message-ID: <2l7mfm$dlj_at_sleepy.cs.keele.ac.uk>


From article <bchow.25.000E0E57_at_rogers.com>, by bchow_at_rogers.com (Bob Chow):

> I am running into a problem with case sensitivity of Oracle. As you know,
> Oracle is sensitive to upper or lower case character strings. To accommodate
> this, I upshift the data inputted from the user in my application as well as
> using the Oracle UPPER function in a Select statement. This way , I
> ensure the I am comparing 2 strings with the same Case. The trouble with this
> is that, when one manipulates a column using a function such as UPPER, the
> Optimizer will suppress use of the index associated with that column,
> ...

I think this is one (common) example of a general problem (which is why I have tried to cross-post to comp.databases.theory).

There is a conflict between

  • FANCY data: the way you want to store it for retrieval (e.g. nicely formatted, punctuated, in mixed case)
  • CANONICAL data: the way you want to index that data (on its essential informational content, ignoring layour, punctuation, case etc.)

The best, and most general, solution is (surely?) to store the information twice, in both canonical and fancy forms. You index the former, and retrieve the latter.

There are many aspects other than case which you may wish to "canonise out". E.g. superficial variations in format, e.g.

        Jan van der Eng
        van der Eng, Jan
        der Eng, Jan van
        van der Eng, Jan

variation in transliteration, e.g.

	Dostoieffsky
	Dostoievsky
	Dostojewsky
	Dostoyevsky
	Dostoevskii

I shall try to summarise this strategy:

If a) you can define a canonical form for a domain of data values,

and b) this is acceptable to the application,

then you can convert all data values (whether they are being inserted

        into the database, or appear in a query) to this form, and also
        use it for indexing.

For a), we must define the canonisation algorithmically, so that arbitrary data values, in queries or updates, can be mechanically converted. This can be quite tricky.

For b), we must have an application which is content to let its data be stored in a form determined by implementation considerations. (Fortunately, users have been softened up by decades of being told that their surnames must have no more than 15 characters, that they can have no more than 4 initials, no accents, no dipthongs, 4 lines of address etc. etc. So being told that their names will, for implementation necessity, be stored in ALL CAPS is only adding insult to injury :-)

Alternatively, you can store "fancy" data, but canonise it on-the-fly with SQL built-ins such as UPPER, SOUNDEX etc: this assumes unindexed (linear) search, and is only a good solution where storage space is expensive but query time is cheap.

Looked at another way, we are trying to reconcile two views of "equality": in the real world, clerks are smart at recognising that two forms of a name or address (etc) are equivalent, whereas an SQL database isn't: it

  • has only a few built-in equality operators and functions, e.g.
     SELECT ...
     WHERE  SOUNDEX(ENAME) = SOUNDEX('SMYTHE')

  • has no provision for defining new operators/functions
  • cannot (?) use its built-in operators in creation of an index, e.g.

     CREATE INDEX ... ON EMP( SOUNDEX(ENAME), ... ) Customers require efficient query evaluation, and efficient query evaluation requires indexes, and indexes require purely *literal* equality, not the *literate* equality inherent in the application. It's a wonder there are so many satisfied database customers out there :-)

(please don't say that OODBs are the solution unless you can prove it :-)


  __   __    Paul Singleton (Dr)           JANET: paul_at_uk.ac.keele.cs
 |__) (__    Computer Science Dept.        other: paul_at_cs.keele.ac.uk
 |  .  __).  Keele University, Newcastle,    tel: +44 (0)782 583477
          Staffs ST5 5BG, ENGLAND         fax: +44 (0)782 713082
Received on Fri Mar 04 1994 - 17:09:26 CET

Original text of this message