c.d.theory lexicon overview

From: Senny <sennomo_at_hotmail.com>
Date: Thu, 22 Apr 2004 07:39:47 GMT
Message-ID: <7FKhc.1320$17.155177_at_news1.epix.net>


Having observed this discussion for a while, and having been prodded by one of its participants, I've decided to throw in my two cents. I have a rather different take on the situation from most programmers, so maybe I can help with a little lingustic analysis.

WARNING: Much linguistic pedantry follows.

PRESENT PROBLEM: Database theory lacks a precise terminology.

ROOT PROBLEM: Computer science lacks a precise terminology. This is the real problem we have to tackle if we wish to fix the specific problem of database theory discussion.

PAINFUL EFFECTS:

  1. Much confusion results from this lack of precision, which really annoys us all from time to time. Countless hours and thousands, possibly millions, of dollars per year are lost on dealing with this confusion worldwide. (I'm serious.) I'd rather spend time getting a project done than arguing over the meaning of 'object'.
  2. Lack of unified terminology results in different programming environments having distinct meanings of the same term. This complicates the joining of different environments.
  3. English may be the primary language of computing, but it is not the only one. Other languages usually adopt our words (with phonological/graphological adjustments) or translate our words to create new terms. This problem is really messy, but since most people don't care too much, I won't beyond mentioning it.

ASPECTS OF THE PROBLEM:

  1. Semantic overloading (Homonymity): Many word-forms we use have too many meanings. _function_ comes to mind.
  2. Semantic overlap (Synonymity and near-synonmity): To some people, 'domain', 'class' and 'type' mean the same thing; to others, they mean three distinct things; to most, they are different yet have overlapping meaning.
  3. Semantic fuzziness: Some words sort-of have one meaning, but nobody is quite sure what that meaning is. 'object' is a prime example. (Compare 'object' to 'pornography'--in both cases, many would say, "I can't tell you what it is, but I know it when I see it.")
  4. Morphological inconsistency: As James L. Ryan hinted at in his "Grammatical Inconsistencies" post, we say "join", "projection", "intersection" and "union" as nouns and "join", "project", "intersect", and ?"union" as verbs. Loosely speaking, in English, we have a marked tendency to use nouns as verbs and adjectives, and verbs as nouns, adjectives as nouns and verbs, etc.

PROOF THAT THIS PROBLEM DOES NOT HAVE TO EXIST: Some scientific fields, namely biology and medicine, have very well-defined terminologies. While there may be some confusion in their fields, it is very limited. Our lives as programmers would be easier if we could attain that level of clarity.

GOAL: The goal is to come as close as possible to having a one-to-one mapping between word-forms and word-meanings (lexemes), with allowances for morphological variants like plurals. Since I don't have italics in plain text, I'll mark forms by surrounding a form with underscore characters. Lexemes are in single quotes. For example, the forms _find_ and _found_ are forms of the lexeme 'find'; likewise, _tuple_ and _tuples_ are forms of the lexeme 'tuple'. When you lack a one-to-one mapping, you have homonymity and/or synonymity. Homonymity is worse than synonymity. So, we want no homonyms and as few synonyms as possible.

HOW TO APPROACH THE GOAL: There is no absolutely right way to solve the problem. However, here are my starting suggestions.

  1. Weed out our current vocabulary.
  2. Kill-file for serious offenders: Some forms are so ambiguous, they have to be removed from the lexicon. Here are just a few: _object_,
    _function_, _attribute_, _entity_, _domain_. These forms represent so many
    different things that even in limited contexts they can still be confusing. Using these words in computer science is kind of like a biologist using the word "creature".
  3. Reconsideration of minor offenders: Some forms are somewhat confusing, but do not cause *too* much time-wasting debate, e.g., _relation_,
    _pointer_, _process_, _thread_, _operating system_, _network_, _array_,
    _byte_, _window_, _drive_, _binary_, _null_. In most cases, context
    suffices to remove ambiguity. We can keep such forms, but it would be nice if we could find better ones. Probably at least some of the meanings of each of these forms should be provided new forms.
  4. Identification of "good" forms: Some current forms are unambiguos enough to keep around, e.g., _tuple_, _socket_, _integer_, _bit_, _octet_,
    _signed_, _modem_, _processor_.
  5. Assign forms to replace the ones we got rid of.
  6. Take unused words from English and apply them in a specific sense. Most computing terms in fact come from everyday English. This approach seems convenient at first, but usually causes more confusion in the long run--after all, several times over, somebody thought that assigning the form
    _function_ to yet another lexeme was a good idea.
  7. Derive new forms from accepted forms. Numerous current terms were created this way: _unsigned integer_, _bit_ (from _binary integer_), _byte_ (from _bit_), _nybble_ (from _byte_), _varchar_ (from _variable character_), _modem_ (from _modulator/demodulator_), _download_ (from
    _down_ and _load_), etc. Many acronyms which later became accepted as
    terms in their own right come from this approach, e.g., _FTP_, _DNS_,

    _MIME_, _SQL_, _RAM_, _grep_, _RFC_, etc. The success of this approach
    depends on the clarity of the base forms used to create the derived forms.
  8. Adopt words from other languages. English is full of adopted words, mostly from Old French (thanks to the Normans). More recently, science and philosophy have introduced lots of Latin and Greek words. Here are just a few such words we use in computing: 'cache' (from French), 'integer' (from Latin), 'predicate' (from Latin), 'algorithm' (from Arabic), 'algebra' (from Arabic), 'calculus' (from Latin). We tend to adopt words when we can't find one we already have that quite fits.
  9. Make up a new form basically out of nowhere. Such forms usually come from some form of psychological association with an existing idea. Here are a few: 'dongle', '404' (from the HTTP protocol), 'baud' (based on the name Baudot), 'bogon' (based on 'bogus'), 'frob', 'kluge' (perhaps from German or Polish, but not in its current sense), 'munge', 'boolean' (based on the name 'Boole'), gaussian (based on the name 'Gauss'), 'spam', 'swizzle'. Such words are rare, because even though everybody invents new words from time to time, they rarely catch on. After all, if I decide that the procedure kind of _function_ should now be called _meklor_, who would go along with it?

BARRIERS TO SOLVING THE PROBLEM:

  1. Tradition: We inherit words from people already using them, whether they're good or not. Computer science (and especially database theory) inherited much from mathematics, a field with a fuzzy, context-dependent terminology (mostly evident in its notation). There has never been a major concerted effort to clarify our vocabulary. We have thousands of terms in today's computing lexicon, and at least hundreds (including some of the most common ones!) are problematic. How can we fight this?
  2. Ad-hockery: When people come up with a new idea, they often hastily assign some label to it without much consideration for the future. Worse yet, they often usurp an existing form. This is especially the case in our field. For example, a Java 'attribute' is *nothing like* a C# 'attribute' (but rather a C# 'field'). (Aside: If I see one more new use of the forms
    _attribute_ or _function_, I might puke.) Even if we successfully combat
    tradition, we have to beware of ad-hockery, or the problem will reappear.
  3. Comceptual confusion: Computing is so new and so constantly changing that in many senses, we don't know what we're doing. It's hard to put a label on a concept that we can't put our finger on, so to speak. Context is a major issue. We have layers of abstraction upon abstraction. The layering effect gives rise to differentiation between terms like 'primitive type' and 'abstract data type' or 'type' and 'class'. A variable's 'value' can be another 'variable'. Are 'type' and 'domain' the same thing? If so, how? I could go on forever on this point, so I'll stop here.

CONCLUDING REMARKS: I hope I have adequately explained the scope of the problem. It's bigger than most people realize. I don't know if it will be conquered, but I know it can be. After all, physicians didn't always have the precise terminology they have today. There are many other details I could discuss, but they're not necessary for an overview of the problem.

For those of you who find the task of restructuring English computing terminology too daunting but still crave clarity, you can learn Lojban (http://www.lojban.org/). I find Lojban a bit too computerish for my human language needs, and nobody I know speaks it, so I'll stick with attempting to improve English.

--Senny Received on Thu Apr 22 2004 - 09:39:47 CEST

Original text of this message