Re: Clean Object Class Design -- What is it?

From: Jim Melton <Jim.Melton_at_Technologist.com>
Date: Sun, 02 Sep 2001 07:40:17 GMT
Message-ID: <3B91E25D.32112D29_at_Technologist.com>


Content-Type: text/plain; charset=us-ascii; x-mac-type="54455854"; x-mac-creator="4D4F5353" Content-Transfer-Encoding: 7bit

Bob Badour wrote:

> Jim Melton wrote in message <3B9085E5.A5547CC1_at_Technologist.com>...
> >
> >Bob Badour wrote:
> >
> >> >The only
> >> >way to model the associations as value-based would be to create
> synthetic
> >> IDs.
> >>
> >> Why? Do the entities you manipulate have no logical identity?
> >
> >One of the fundamental concepts of object technology is that objects have
> >intrinsic identity independent of any attribute values they may posess at
> any
> >point in time.
>
> Please use well-defined terms. Object variables (instances) have intrinsic
> identity as do all variables. However, this does not help users disambiguate
> similar yet separate variables.

An object is an object. A "thing" under consideration. This concept of object identity is independent of computers and databases. "The red-haired woman whom I saw cross the street last Tuesday" is an object. "The block of data residing in sector 2057, track 32 of disk B" is an object. "The complete financial record of Bob Badour" is an object.

Frequently, users *do* have a hard time disambiguating similar "variables". (Can you tell twins apart?). That's why query by example is a powerful user interface technique. When the user says, "That one" he is using a "pointer".

> >This notion of intrinsic identity is reinforced in object
> >databases by our pointers to objects.
>
> Yes, by pointers to variables. But this does not help users disambiguate
> similar values stored in separate variables. Are you saying that your
> entities have not logical identity? That users cannot disambiguate similar
> entities?

Often not. Logical identity is independent of the value of temporally changing attributes, yes? But if all I know about something is "The red-haired woman whom I saw cross the street last Tuesday", there is not suffient information to represent logical identity, unless I make up an arbitrary identifier. There is no reason to assume that another red-haired woman crossing the street next Tuesday will necessarily be the same woman.

However, if I want to accumulate evidence in support of a hypothesis, all information, no matter how sketchy can be of value. A way to reference the data that recognizes its intrinsic identity is required.

> >In a relational database, the paradigm is
> >always to copy the data out of the database, perform some manipulations (as
> >required), then find the appropriate record(s) again and modify whatever
> values
> >are changed.
>
> There goes that word again. I am convinced that you confuse yourself with
> pretentious, nebulous terminology. Instead of calling everything a paradigm,
> try identifying exactly what you want to say. Instead of calling everything
> an object, try identifying exactly what you want to say.

This is why it's really hard to talk to you, Bob. I said exactly what I wanted to say.

From Miriam Webter (http://www.m-w.com)
paradigm: 1 : EXAMPLE, PATTERN; especially : an outstandingly clear or typical example or archetype

object: 4 : a thing that forms an element of or constitutes the subject matter of an investigation or science

> You have it backward. ODBMSes require the above process, but relational
> databases do not. One can send a set-oriented command to the RDBMS that
> manipulates data entirely within the DBMS process.

Really. Can I send a set-oriented command to the RDBMS to find a least-squares path through a series of points? Can I send a set-oriented command to the RDBMS to find a statistical probability that two measurements (including their error distributions) represent the same event? Can I send a set-oriented command to the RDBMS to predict the likely next state of a Markhov model?

Believe it or not, sometimes people want to apply algorithms to the result of a query. In that (extremely common) case, the relational (or SQL if you can prove otherwise) *paradigm* is to copy the data from a result table into a data structure the algorithm can use. Object databases do not require this extra step.

> >In the object database, this data copying step is eliminated.
>
> Actually, in the object database, this data copying step is required in
> order to make the data available to the application programming language for
> data manipulation. It is not required in an RDBMS because relational
> databases have their own data manipulation language.

You are just creating a different programming environment out of your (theoretical) RDBMS. If all processing occurs in the context of the DBMS, the system cannot scale well and the DBMS becomes a bottleneck. Unless your RDBMS data manipulation language can support all the kinds of algorithms that are coded in other languages, this is (at best) a red herring.

> >The
> >database becomes much less of external entity (conceptually) and data is
> >manipulated (conceptually) directly.
>
> This is simply untrue. Conceptually, one must control persistence, and the
> term persistence, itself, implies a copy of data.

No, persistence merely implies that the data exists outside the scope of program execution. How this is accomplished is a *physical*, implementation detail. Remember bubble memory? It didn't lose state even after power was removed. Persistence without data copy.

> >I say conceptually, because obviously as data is moving to and from disk
> there
> >is copying going on. However, an object reference allows me to manipulate a
> >persistent object directly without regard to this copying.
>
> One cannot ignore the copying going on. At a conceptual level, the
> programmer must still specify which object variables get copied into and out
> of the application programme's memory. At a conceptual level, the programmer
> must still specify when and how to retrieve values from the database.

One certainly can. It is this point exactly that I was making above. Because the ODBMS makes a persistent object reference *look* exactly like any other programming language variable (pointer, if you wish), the application programmer has no concern for the copying of object variables into/out of memory.

Conceptually, the programmer must query the database for objects of interest, but this is not a concept unique to persistent data. Often transient collections of things will be queried to find the subset of interest.

Conceptually, the programmer must be cognizant of transaction boundaries and transaction semantics. I can't think of any way to avoid this unless you give up the concept of ACID transactions (including rollback). That loss would be too great to bear, so the burden for transaction semantics must continue to rest with the programmer.

> >(By the way, I consider this whole difference in paradigm with regard to
> >explicit copying into/out of the database as one of the key
> >philosophical/architectural differences between object databases and
> >relational/SQL databases)
>
> Paradigm: A set of assumptions, concepts, values, and practices that
> constitutes a way of viewing reality for the community that shares them,
> especially in an intellectual discipline.

My dictionary had a slightly different definition (see above). Or, if you prefer, try:

3 : a philosophical and theoretical framework of a scientific school or discipline within which theories, laws, and generalizations and the experiments performed in support of them are formulated

Your point?

> The object oriented community have false assumptions, nebulous concepts,
> warped values and arbitrary practices. The relational community have
> explicit assumptions, precisely defined concepts, principled values and
> reasoned practices.

My, you are painting with an awfully broad brush tonight. All of this because you chose to react to my choice of words instead of the point I was making?

> I don't think physical copying has much to do with the differences in the
> "paradigms".

Let me try to be more precise for you. Physical copying (or even logical copying?) is a fundamental difference between programming with a result-table (or cursor) database and an object database. When I say "SELECT A from FOO" I must bind the returned value(s) for A to application-space variables before I can use them. Furthermore, if my algorithm ends up changing the value of A, I must then issue an explicit "UPDATE FOO values (A = newvalue)" to ensure the change is propagated to persistent memory. Note that before this update step, the changed value of A is available to other processing in application space and my application does not have a coherent view of the data space.

In an ODBMS, the same "SELECT from FOO" will return me object reference(s) to FOO objects. If my algorithm needs the A value, it simply uses it [ print obj->a() ]. If it needs to update the value, it does it directly [ obj->a( newvalue ) ]. The data space is consistent within my transaction (a second SELECT statement will automatically see the updated value of A), but not propagated to other transactions until the commit boundary.

Thus, the programmer does not write any code to copy values into or out of application space. The PATTERN (paradigm) of table programming is copying data. The PATTERN (paradigm) of object programming is not.

> >This whole concept of intrinsic identity is extremely critical in my domain
> >because often we do NOT know what attribute value could be used to uniquely
> >identify an object. Sometimes, all we know is that there is an object
> observed
> >or inferred through some phenomenology. Over time, we hope to discover more
> of
> >the attribute values attributable to that object, but in the mean time it
> must
> >be distinct from all other objects under consideration.
>
> How do the users of your system identify the distinct instances under
> consideration?

Different ways in different contexts. QBE is an extremely powerful UI technique.

> >Object databases handle this representation of uniqueness with object
> >references (commonly referred to as OIDs).
>
> Using pointers, yes, I know that. We already know what a disaster it is to
> expose pointers to users. If you do not expose OID to users, how do users
> identify unique instances?

See, I don't get your point. An OID is not a pointer. In the database system I use, an OID has a native representation (4 16-bit numbers) and a stringified representation ( #dd-cc-pp-ss ). Neither of these are "pointers" any more than a rowID is a pointer. Yet, because of the operator overloading in OO languages, they can appear as a pointer to the programmer.

Again, though, user interfaces are written to facilitate users doing their jobs. When a user of an on-line ordering system orders a new printer, he does NOT copy the SKU number into an order-entry text field. He clicks on a picture of the product. The user is POINTING to the data of interest. Why can't software do the same thing?

> >SQL databases can generate synthetic
> >IDs such as rowID (that are virtually the same as OIDs).
>
> Except that they are symmetric and do not require navigation.

Pardon me for not being fully cognizant of your vocabulary, but how are they "symmetric" (and how are OIDs not)?

While they do not *require* navigation, they are commonly used for precisely that by the database. But I guess I could argue that OIDs do not *require* navigation. Again, though, they are not of much use without it.

> >However, if there are
> >no attributes that can be used to create a distinct relation, how would a
> >relational database handle this concept of intrinsic identity?
>
> Identity is intrinsic to variables. Relation variables are uniquely
> identified by name. Tuple variables are uniquely identified by relation name
> and key value. Object variables are uniquely identified by relation name,
> key value and column name.

And if there is no unique key value, you manufacture one?

> >> >> >I find attribute joins
> >> >> >problematic, especially where they force synthetic IDs into the data
> >> model
> >> >>
> >> >> Do you mean you would prefer not to have any form of logical
> identifier?
> >> You
> >> >> will find such a lack much more problematic.
> >> >
> >> >I find that a normalized model does not usually consist of stand-alone
> >> >entities. For example (again), a contact database should have multiple
> >> phone
> >> >numbers for a contact.
> >>
> >> And the user should have some method for identifying each of these phone
> >> numbers. Home, work, fax, cell, emergency, alternate office on tuesdays
> and
> >> thursdays...
> >
> >But you are incomplete. You need a field (pardon me for not using the right
> >term) to join the phone number with the contact record.
> >Since "logically" the
> >contact is uniquely identified by the sum of the fields (name, address,
> title,
> >company, etc.), in practice a synthetic ID is created to represent the
> unique
> >identity of the contact.
>
> Logically, a contact is identified by some number. The sum of the other
> fields need not be unique and the user must have some method to tell them
> apart. Organizations assign numbers to all kinds of things: employees,
> customers, license holders, benefit recipients, dependents, departments,
> accounts, bins, locations. They were doing that long before computers ever
> came along.

Why? Why is a contact "logically" identified by some number? Do you assign numbers to all the people in your address book? Do you refer to them by number? This is utter nonsense. Arbitrary numbers are implementation artifacts of systems that cannot properly represent intrinsic object identity.

For example, a telephone number is an arbitrary identifier (although more closely related to a pointer) for a specific end-point in the telephone network. In the early days of telephones, an operator was required to physically connect the incoming call with the end-point by plugging a cable into the appropriate hole. Yet, in many rural communities, the end point was not identified by a number, but "Bill Jones' house". As human operators were completely replaced by machines, machine-readable end-point identifiers were required, hence the phone number. But if I had a way to "gesture" to your entry in my "contact database" and pass a direct end-point (pointer) to my telephone (or to my e-mail program, or to my envelope printer), then arbitrary, synthetic IDs would phase out as archaic relics of an unenlightened past.

> >This synthetic ID is stored in each phone number so
> >that it can be joined back to the contact.
>
> Incorrect, both logically and physically. Logically: An association table
> might expose the relationship between contact id and phone number.
> Physically: An RDBMS might store the phone number with the contact fields
> using juxtaposition to identify the contact, but if it does so, it exposes
> the association to the user using the contact identifier and phone number.

If you wish to design your database such that all associations are through a distinct "association table", that's fine. Object modelling has "link classes" that perform the same purpose. But that is a heavyweight solution for simple associations that are commonly modelled by repeating foreign key information in the phone number table.

All you've done is require two distinct identifiers (and allowed phone number to be one) so they can be stored in your association table. Of course, each of these association entries will require a unique identifier...

> >> >Perhaps each number would include a "type" tag (home,
> >> >cell, etc.). In order to associate this phone information with the
> contact
> >> >info, either a synthetic ID must be generated or the primary key values
> >> must be
> >> >replicated.
> >>
> >> I am not sure I understand your complaint. Are you complaining about
> >> redundant information in the logical view of the data? Pointers are as
> >> redundant, if not more so.
> >
> >A pointer is a physical implementation of a logical concept.
>
> A pointers is a logical exposure of a physical concept (location).

Since the location of a {thing} is a physical concept, I hope we can agree that a pointer is a physical thing. But you have been (uncharacteristically) sloppy at equating object identifiers with pointers. Since the logical concept I was describing is the association between a contact and his phone number(s), using a pointer to implement this association is a physical implementation. I've already described the nature of OIDs in the database I use and they are not at all dissimilar from rowIDs. Yes, the database can use them to hash directly to a specific object, but they are no more pointers than a phone number is a pointer to your phone.

> >"Home phone: 210
> >555 1212" has no meaning unless it is associated with the person whose
> phone it
> >is. I believe that coupling is *logically* very tight and that it is
> reasonable
> >to implement it as a pointer rather than creating synthetic fields upon
> which
> >to join.
>
> If a user needs to answer the question of "How many home phone numbers do we
> have in our contact database?", the coupling is totally irrelevant.

Let me be more precise. The phone number above has no *semantic* meaning unless it is associated with the person whose phone it is.

> Since the contact has a logical identifier and the phone number has a
> logical identifer, it is reasonable to expose the relationship to the user
> by combining the identifiers.

And you would expose this to the "user" as "Contact ID A473B has (a) phone 210 555 1212". This is the combination of the (il)logical identifiers.

> >> Nothing prevents you from doing that. The relational model only requires
> >> that you allow the user to query the phone numbers as if they are
> >> independent of the contact. To the user, the DBMS must expose the
> >> association between the phone number and the department explicitly using
> >> values regardless of how the DBMS physically establishes the association.
> >
> >The first half I can accomodate. I can query against any object in my
> object
> >database. The fact that there may be an association (pointer, if you wish)
> with
> >another object is irrelevant. (To be fair, my particular vendor does NOT
> >supporting queries across relationships so a query of the form "Find all
> the
> >contacts whose home phone is in area code 808" would be difficult to
> >accomplish).
>
> And you complain about the logical interface of the relational model... ?

I (honestly) point out a real short-coming with the (real) commercial product with which I program. There is no fundamental reason why this should be so, but it is so and I refuse to play "what if" games. As you are so fond of saying, a failure of commercial products is not a failure of the model.

> >The second part, "the DBMS must *expose* (emphasis mine) the association
> ...
> >explicitly using values" I don't understand. If there is no *logical* value
> >that identifies the association, how should this exposure take place.
>
> The phone number must have a logical identifier, possibly the phone number
> itself. The contact must have a logical identifier or the users won't be
> able to easily identify contacts.

Synthetic IDs are evil because they carry no semantic content. How often have you mis-dialed a phone number? How many of your credit card or frequent flier numbers do you have memorized? A "logical" model that forces more of these into the interface is flawed.

Information is identified in context. If I were to go to the Washington, D.C. phone book and look up "Bob Badour", I would find 0 or more matches. There is no reason for me to assume that any of these people is the person with whom I have been having this conversation for these weeks. I identify you "uniquely" as the Bob Badour who has been posting in comp.databases.object. I do not create a number to represent you.

> >You seem
> >to be mandating that synthetic IDs be created to be used in a logical join
> that
> >are not necessary in either the logical or the physical level.
>
> Define synthetic. Unless you advocate a complete lack of logical identity,
> the user will need to have some means to identify contacts and some means to
> identify phone numbers. Use those means.

Logical identity is synonymous with what I called an object's intrinsic identity.

Quite often humans disambiguate by pointing. When you walk through the cafeteria line, you point to the lady with the scoop which jello salad you want, since it is the most precise "identifier". When I call my parents, I hit the speed dialer (I don't remember their phone number). Since users point, it seems that you are advocating pointers :-)

> >The English language has only a very few concepts: noun, verb, adjective,
> >adverb, preposition, conjunction (I may have missed one or two). Yet I don't
>
> >think anyone would argue that mastering it is simple.
>
> You have missed many concepts, and you have ignored the confounding
> complexity. Much as you ignore the confounding complexity of ODBMS.

Hmmm. If I'm missing all these concepts and ignoring complexity, perhaps it's not so complex after all. Otherwise, shouldn't this complexity be causing me untold grief?

> >Relations may be simple, but that does not mean that their usage may not be
> >exceedingly complex.
>
> Relations and the relational algebra are much simpler than the english
> language. It is true that one can model real-world systems to arbitrary
> levels of complexity with this simple interface. What I don't understand is
> any insistence on adding further needless complexity.
>
> >Cognitive modelling has shown that human beings can only
> >keep a finite number of concepts in active memory.
>
> All the more reason to suggest as simple an interface as possible -- the
> relational model.

You've missed the point. Why does FedEx assign a tracking number to your package? Because identifying "the package that Bob Badour sent to Jim Melton on Sept 1, 2001" is too complex (although it can easily be represented as a relation). People routinely create concepts that may "add complexity to the interface" in order to sheild themselves from greater complexity.

> >In order to deal with more
> >complex things, we hide complexity behind abstractions.
>
> Relations are very simple abstractions.

One could represent all data as sequences of name-value pairs. Such data would extremely simple, but exceedingly complex to work with, because the sequence would be devoid of semantic content.

> >Object classes have interfaces that reflect the complexity that is
> >already inherent in the data.
>
> Unfortunately, object classes often go beyond this and expose the complexity
> inherent in the physical representation of the data as well as that inherent
> in the data itself.

One must question if you understand object technology at all. Since it is completely possible to declare a class that is all interface and no implementation (no data members), it is ludicrous to assert that object classes expose implementation details (physical representation).

> >Sure, you can argue that a user must understand
> >some amount of the object model to become productive, but I don't see how
> that
> >is any different in any paradigm.
>
> There goes that word again. Why do you use it for almost everything? Are you
> not able to conceive of a meaningful word to use in its place?

Obviously not. Why don't you offer an alternative that won't push your hot button.

> Users understand relations with very little effort because all relations
> have an identical interface using identical operations.

Syntax is never particularly interesting. Knowing *what* I can do is a far cry from knowing *why* I would want to do it (and when I would NOT want to do it).

> >If I don't understand the way all the tables
> >are related and what fields join what tables in what context, how
> productive
> >will I be?
>
> Very productive. All you need to know is the way the system catalog tables
> are related.

Nonsense.

We have a diagram that depicts all the tables and relationships between tables in a particular database used by our customer. It is incomprehensible. There is NO hiding of complexity -- it is all out before us with no way to break it down into bite-sized pieces. I can tell exactly how each table is related, but I can't figure out what the tables MEAN, which means I can't use the data productively.

> >Object classes attempt to model what the user already has to figure
> >out anyway.
>
> I disagree that the user has to figure out a complex object interface for
> every possible relation, and I must point out that object classes handle the
> job very poorly.

See above.

> >Object databases use objects naturally to manage complex notions (and
> >relationships).
>
> I have yet to meet a casual database user who found objects natural. In
> fact, I have found many experienced, skillful application programmers who do
> not find them at all natural.

It all depends in what circles you move, I suppose. Here in comp.databases.object I think your findings would be somewhat different.

> >Yes, I understand the concept. I did not ask you to agree with me.
>
> You have yet to exhibit any understanding.

 ... to your satisfaction.

One of the difficulties in discussing things with you is that you cannot agree to disagree. You must be right and I must be wrong. I see your point. I do not agree with it. Statements such as the above exemplify the allegation I made a while ago about you being an intellectual snob (or something like that). It is quite a condescending remark.

I will readily admit that I do not have the "official Date & Pascal" vocabulary for describing purist relational theory internalized. I may not use your words with the precision that you would like. I would not even pretend to debate you on the finer points of relational theory. But I have used databases that are called relational and I have used database that are called object-oriented. I have decades of experience in writing software for large, complex systems. And IN MY EXPERIENCE, complexity is best managed through the use of objects.

Once again, I don't ask you to agree with me.

> One cannot start with a simple interface and make it more simple by adding
> features.

Decomposing the works of Shakespeare into it's component letters and storing each letter with a frequency count would be a simple interface. But I think it could be made simpler by adding features...

--
Jim Melton, novice guru             | So far as we know, our
e-mail: Jim.Melton_at_Technologist.com | computer has never had
v-mail: (303) 971-3846              | an undetected error.

--------------A4D78B99D6FE0B3A50FC7B3D
Content-Type: text/x-vcard; charset=us-ascii;
 name="Jim.Melton.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Jim Melton
Content-Disposition: attachment;
 filename="Jim.Melton.vcf"

begin:vcard 
n:Melton;Jim
x-mozilla-html:FALSE
adr:;;;;;;
version:2.1
email;internet:Jim.Melton_at_Technologist.com
x-mozilla-cpt:;1
fn:Jim Melton
end:vcard

--------------A4D78B99D6FE0B3A50FC7B3D--
Received on Sun Sep 02 2001 - 09:40:17 CEST

Original text of this message