Having observed this discussion for a while, and having been prodded by one
of its participants, I've decided to throw in my two cents. I have a
rather different take on the situation from most programmers, so maybe I
can help with a little lingustic analysis.
WARNING: Much linguistic pedantry follows.
PRESENT PROBLEM:
Database theory lacks a precise terminology.
ROOT PROBLEM:
Computer science lacks a precise terminology. This is the
real problem we have to tackle if we wish to fix the specific problem of
database theory discussion.
PAINFUL EFFECTS:
- Much confusion results from this lack of precision, which really annoys
us all from time to time. Countless hours and thousands, possibly
millions, of dollars per year are lost on dealing with this confusion
worldwide. (I'm serious.) I'd rather spend time getting a project done
than arguing over the meaning of 'object'.
- Lack of unified terminology results in different programming environments
having distinct meanings of the same term. This complicates the joining of
different environments.
- English may be the primary language of computing, but it is not the only
one. Other languages usually adopt our words (with
phonological/graphological adjustments) or translate our words to create
new terms. This problem is really messy, but since most people don't care
too much, I won't beyond mentioning it.
ASPECTS OF THE PROBLEM:
- Semantic overloading (Homonymity): Many word-forms we use have too many
meanings. _function_ comes to mind.
- Semantic overlap (Synonymity and near-synonmity): To some people,
'domain', 'class' and 'type' mean the same thing; to others, they mean three
distinct things; to most, they are different yet have overlapping meaning.
- Semantic fuzziness: Some words sort-of have one meaning, but nobody is
quite sure what that meaning is. 'object' is a prime example. (Compare
'object' to 'pornography'--in both cases, many would say, "I can't tell you
what it is, but I know it when I see it.")
- Morphological inconsistency: As James L. Ryan hinted at in his
"Grammatical Inconsistencies" post, we say "join", "projection",
"intersection" and "union" as nouns and "join", "project", "intersect", and
?"union" as verbs. Loosely speaking, in English, we have a marked tendency
to use nouns as verbs and adjectives, and verbs as nouns, adjectives as
nouns and verbs, etc.
PROOF THAT THIS PROBLEM DOES NOT HAVE TO EXIST:
Some scientific fields, namely biology and medicine, have very well-defined
terminologies. While there may be some confusion in their fields, it is
very limited. Our lives as programmers would be easier if we could attain
that level of clarity.
GOAL:
The goal is to come as close as possible to having a one-to-one mapping
between word-forms and word-meanings (lexemes), with allowances for
morphological variants like plurals. Since I don't have italics in plain
text, I'll mark forms by surrounding a form with underscore characters.
Lexemes are in single quotes. For example, the forms _find_ and _found_
are forms of the lexeme 'find'; likewise, _tuple_ and _tuples_ are forms of
the lexeme 'tuple'. When you lack a one-to-one mapping, you have
homonymity and/or synonymity. Homonymity is worse than synonymity. So, we
want no homonyms and as few synonyms as possible.
HOW TO APPROACH THE GOAL:
There is no absolutely right way to solve the problem. However, here are my
starting suggestions.
- Weed out our current vocabulary.
- Kill-file for serious offenders: Some forms are so ambiguous, they
have to be removed from the lexicon. Here are just a few: _object_,
_function_, _attribute_, _entity_, _domain_. These forms represent so many
different things that even in limited contexts they can still be confusing.
Using these words in computer science is kind of like a biologist using the
word "creature".
- Reconsideration of minor offenders: Some forms are somewhat confusing,
but do not cause *too* much time-wasting debate, e.g., _relation_,
_pointer_, _process_, _thread_, _operating system_, _network_, _array_,
_byte_, _window_, _drive_, _binary_, _null_. In most cases, context
suffices to remove ambiguity. We can keep such forms, but it would be nice
if we could find better ones. Probably at least some of the meanings of
each of these forms should be provided new forms.
- Identification of "good" forms: Some current forms are unambiguos
enough to keep around, e.g., _tuple_, _socket_, _integer_, _bit_, _octet_,
_signed_, _modem_, _processor_.
- Assign forms to replace the ones we got rid of.
- Take unused words from English and apply them in a specific sense. Most
computing terms in fact come from everyday English. This approach seems
convenient at first, but usually causes more confusion in the long
run--after all, several times over, somebody thought that assigning the form
_function_ to yet another lexeme was a good idea.
- Derive new forms from accepted forms. Numerous current terms were
created this way: _unsigned integer_, _bit_ (from _binary integer_), _byte_
(from _bit_), _nybble_ (from _byte_), _varchar_ (from _variable
character_), _modem_ (from _modulator/demodulator_), _download_ (from
_down_ and _load_), etc. Many acronyms which later became accepted as
terms in their own right come from this approach, e.g., _FTP_, _DNS_,
_MIME_, _SQL_, _RAM_, _grep_, _RFC_, etc. The success of this approach
depends on the clarity of the base forms used to create the derived forms.
- Adopt words from other languages. English is full of adopted words,
mostly from Old French (thanks to the Normans). More recently, science and
philosophy have introduced lots of Latin and Greek words. Here are just a
few such words we use in computing: 'cache' (from French), 'integer' (from
Latin), 'predicate' (from Latin), 'algorithm' (from Arabic), 'algebra'
(from Arabic), 'calculus' (from Latin). We tend to adopt words when we
can't find one we already have that quite fits.
- Make up a new form basically out of nowhere. Such forms usually come
from some form of psychological association with an existing idea. Here
are a few: 'dongle', '404' (from the HTTP protocol), 'baud' (based on the
name Baudot), 'bogon' (based on 'bogus'), 'frob', 'kluge' (perhaps from
German or Polish, but not in its current sense), 'munge', 'boolean' (based
on the name 'Boole'), gaussian (based on the name 'Gauss'), 'spam',
'swizzle'. Such words are rare, because even though everybody invents new
words from time to time, they rarely catch on. After all, if I decide that
the procedure kind of _function_ should now be called _meklor_, who would
go along with it?
BARRIERS TO SOLVING THE PROBLEM:
- Tradition: We inherit words from people already using them, whether
they're good or not. Computer science (and especially database theory)
inherited much from mathematics, a field with a fuzzy, context-dependent
terminology (mostly evident in its notation). There has never been a major
concerted effort to clarify our vocabulary. We have thousands of terms in
today's computing lexicon, and at least hundreds (including some of the
most common ones!) are problematic. How can we fight this?
- Ad-hockery: When people come up with a new idea, they often hastily
assign some label to it without much consideration for the future. Worse
yet, they often usurp an existing form. This is especially the case in our
field. For example, a Java 'attribute' is *nothing like* a C# 'attribute'
(but rather a C# 'field'). (Aside: If I see one more new use of the forms
_attribute_ or _function_, I might puke.) Even if we successfully combat
tradition, we have to beware of ad-hockery, or the problem will reappear.
- Comceptual confusion: Computing is so new and so constantly changing that
in many senses, we don't know what we're doing. It's hard to put a label
on a concept that we can't put our finger on, so to speak. Context is a
major issue. We have layers of abstraction upon abstraction. The layering
effect gives rise to differentiation between terms like 'primitive type' and
'abstract data type' or 'type' and 'class'. A variable's 'value' can be
another 'variable'. Are 'type' and 'domain' the same thing? If so, how? I
could go on forever on this point, so I'll stop here.
CONCLUDING REMARKS:
I hope I have adequately explained the scope of the problem. It's bigger
than most people realize. I don't know if it will be conquered, but I know
it can be. After all, physicians didn't always have the precise
terminology they have today. There are many other details I could discuss,
but they're not necessary for an overview of the problem.
For those of you who find the task of restructuring English computing
terminology too daunting but still crave clarity, you can learn Lojban
(http://www.lojban.org/). I find Lojban a bit too computerish for my human
language needs, and nobody I know speaks it, so I'll stick with attempting
to improve English.
--Senny
Received on Thu Apr 22 2004 - 02:39:47 CDT