Re: Sensible and NonsenSQL Aspects of the NoSQL Hoopla

From: James K. Lowden <jklowden_at_speakeasy.net>
Date: Sun, 1 Sep 2013 12:53:22 -0400
Message-Id: <20130901125322.25bb7fdb.jklowden_at_speakeasy.net>


On Sun, 1 Sep 2013 03:47:35 -0700 (PDT)
karl.scheurer_at_o2online.de wrote:

> Am Samstag, 31. August 2013 19:59:52 UTC+2 schrieb James K. Lowden:
> > On Sat, 31 Aug 2013 08:22:41 -0700 (PDT)
> > karl.scheurer_at_o2online.de wrote:
> > > Codd's model emerged out of the technology of the seventies and
> > > needs urgently a revision.
> >
> > That's a novel observation. I'm sure others would be interested to
> > know of any aspect of the relational model rooted in 1970s
> > technology.
> >
> My observation is based on Codd's paper of 1970 "A Relational Model
> of Data for Large Shared Data Banks"

I appreciate the effort you took to defend your idea, and I think I now understand what you mean. But you are confusing the incidental reference to contemporary technology with its influence on the relational model.

Yes, you can find references to 1970's technology in a 1970s paper and, yes, Codd was a product of his day just as anyone is. His work was motivated by commercial needs in a commercial firm, and he was applying his mathematical expertise to real-world problems. His paper relates his theoretical work to those problems in terms of the technology then extant.

What you will not find is any aspect of the theory that is *tied* to any technology, 1970s or other. The theory is all about relations and operations and constraints. Naturally both Codd and his contemporaries were interested in how that theory might be applied to real computers of the day; then as now many insisted that his pointy-headed math could never be implemented efficiently. That quite different from saying the theory is somehow blinkered by the limitations of those computers.

> He addresses the following problems
> 1.2.1. Ordering Dependence.
> "Let us consider those existing systems which either require or
> permit data elements to be stored in at least one total ordering
> which is closely associated with the hardware-determined ordering of
> addresses. "

If you remove "hardware-determined" from the sentence, it's exactly as true now as then.

> 1.2.2. Indexing Dependence.
> "...destroy indices from time to time will probably be necessary. The
> question then arises: Can application programs and terminal
> activities remain invariant as indices come and go?..."
>
> In the seventies "bigdata" had to be stored on sequential data
> storages (tapes, cards). Querying data from sequential media cannot
> use indices ("indices go").

Hmm, no, I'm pretty sure VSAM and IMS were available in the 70s. Cullinet was selling IDMS.

Codd's "competition" wasn't 1890s hollerith cards. It was (what were later called) hierarchical and network DBMSs that imposed great constraints and complexity on application programmers.

Funny how little has changed, right? You move to Hadoop City and build an entire application around a "known" application domain on a nonstandard filesystem. Then comes the day you'd like summaries by zip code instead by customer account, and you have to write an application instead of a query. Santayana rides again!

> 1.2.3. Access Path Dependence.
> "
> One solution to this is to adopt the policy that once a
> user access path is defined it will not be made obsolete until
> all application programs using that path have become
> obsolete. Such a policy is not practical, because the number
> of access paths in the total model for the community of
> users of a data bank would eventually become excessively
> large."
>
> That statement is based on the hardware of the seventies.

On the web I believe it's called "404".

> First normal form and normalization
> "
> So far, we have discussed examples of relations which are
> defined on simple domains-domains whose elements are
> atomic (nondecomposable) values. Nonatomic values can
> be discussed within the relational framework. Thus, some
> domains may have relations as elements. These relations
> may, in turn, be defined on nonsimple domains, and so on.
> "
> It is clear, Codd started 1970 with a design like the
> "document storages" in NOSQL or the N1F systems of the past.

Yes.

> For reasons not comprehensible any more (Codd's reference is
> out of print and not online available), he restricted his model

No mystery. Books in a library are hardly lost texts of Babylon. And he states his motivation plainly: "the possibility of eliminating nonsimple domains appears worth investigating!"

> "1.4. NORMAL FORM
> A relation whose domains are all simple can be represented
> in storage by a two-dimensional column-homogeneous
> array of the kind discussed above. Some more
> complicated data structure is necessary for a relation with
> one or more nonsimple domains. For this reason (and others
> to be cited below) the possibility of eliminating nonsimple
> domains appears worth investigating! There is, in fact, a
> very simple elimination procedure, which we shall call
> normalization.
> "

The model is not "restricted". It is *simplified*, a feature, not a bug. By showing -- more, *proving* -- that logical inferences could be drawn from data manipulated with a small number of operators closed over a domain, Codd released programmers from low-level complexity and man-centuries of work.

> Meanwhile are complex dynamic data structures (trees, graphs,
> lists... ) part of standard liraries for common mainstream programing
> languages.

If you're programming a computer, graphs are a your natural ally because they can be mapped directly onto the computer's memory. They're of no use, though, when you want to manage data logically. How, for example, do you define a subset of a cyclic graph?

You're right to say that graphs are more complex than relations. It's a mistake, though, to conclude therefore that they are more powerful. It's been proved mathematically that graphs and relations are interchangeable in the sense that they can represent the same information. The difference is that relational theory is much simpler. That's its advantage, not a handicap.

> "Future users of large data banks must be protected from
> having to know how the data is organized in the machine (the
> internal representation)"
>
> For me as a programmer this sounds like a textbook example for
> object design.

OK, but it's not.

Consider the UNIX filesystem, for instance, which you refered to earlier. Upon a time, when my mother wrote disk access routines for Univac, the programmer had to know all the particulars of the device, and read/write data in terms of the device's design. Unix revolutionized the field by abstracting all disk access into today's familar stream of bytes. No addresses, no heads or sectors or tracks. A catalog to facilitate sharing that anyone (potentially) can update, not just the system programmers. Works pretty good for nonrotating media, too, and over the network. And not an object in sight.

On the other hand, you are in some sense right, if the DBMS is the object. What OO calls "data hiding" is analogous to what RM calls "data independence". In both cases, the goal is to isolate the application from details it doesn't need and that might change, to permit the application programmer to operate at a higher level. Both also have a notion of "consistent state". I have long thought that stored procedures are to databases what methods are to objects, and subscribe to the idea that applications should access the data only through views and procedures.

Part of your critique is actually of DBMSs that we have, not of RM. SQL DBMSs largely support only a few primitive types that the user may then further constrain or write functions for. One cannot, for example, define an aggregate type as a set of columns, and use that name in, say, FK declarations. Nor can we usually define types of blobs and comparison functions for them (although I'm unconvinced that's a good idea).

And of course SQL itself -- not RM! -- is deeply rooted in IBM's 1970s notion of an end-user query language, to let users write their own reports. You could talk all day to IT departments about math and logic, but you could close the deal convincing them that their reports would write themselves. So we're saddled now with a language no one likes, and that no one thinks expresses relational algebra or calculus well. I wonder if we're going to have live through the rediscovery of the purpose and benefits of the relational model before we see a re-implementation of it that provides a relational language in which to express our queries.

--jkl

P.S. Since you've read this far and we're debating the 70s, I hope you've seen http://www.masswerk.at/google60/. Received on Sun Sep 01 2013 - 18:53:22 CEST

Original text of this message