Re: Sensible and NonsenSQL Aspects of the NoSQL Hoopla

From: Eric <eric_at_deptj.eu>
Date: Sun, 1 Sep 2013 18:17:56 +0100
Message-ID: <slrnl26tm4.sqh.eric_at_teckel.deptj.eu>


On 2013-09-01, karl.scheurer_at_o2online.de <karl.scheurer_at_o2online.de> wrote:
> Am Samstag, 31. August 2013 19:59:52 UTC+2 schrieb James K. Lowden:
>> On Sat, 31 Aug 2013 08:22:41 -0700 (PDT)
>> karl.scheurer_at_o2online.de wrote:
>>
>> > Codd's model emerged out of the technology of the seventies and needs
>> > urgently a revision.
>>
>> That's a novel observation. I'm sure others would be interested to
>> know of any aspect of the relational model rooted in 1970s technology.
>>
>> Do tell.
> My observation is based on Codd's paper of 1970 "A Relational Model of
> Data for Large Shared Data Banks"
>
> He addresses the following problems
> 1.2.1. Ordering Dependence.
> "Let us consider those existing systems which either require or permit
> data elements to be stored in at least one total ordering which is
> closely associated with the hardware-determined ordering of addresses."
>
> Without bypassing all operating systems this is impossible nowadays.
> Before UNIX and other operating systems it was common practice to file
> layout directly in storages.

Nowadays you can not, in general, get that close to the hardware (there are exceptions). However, people still keep stuff in sequential files, thus perpetuating the problem.

> 1.2.2. Indexing Dependence.
> "...destroy indices from time to time will probably be necessary. The
> question then arises: Can application programs and terminal activities
> remain invariant as indices come and go?..."
>
> In the seventies "bigdata" had to be stored on sequential data storages
> (tapes, cards). Querying data from sequential media cannot use indices
> ("indices go").

But Codd also said that indexing "tends to improve response to queries and updates and, at the same time, slow down response to insertions and deletions", so he wasn't thinking about sequential media. Random-access external storage existed before 1970 (though usually expensive and heavy).

> 1.2.3. Access Path Dependence.
> "One solution to this is to adopt the policy that once a user access path
> is defined it will not be made obsolete until all application programs
> using that path have become obsolete. Such a policy is not practical,
> because the number of access paths in the total model for the community
> of users of a data bank would eventually become excessively large."
>
> That statement is based on the hardware of the seventies.

This one definitely has nothing to do with hardware, but with tree-structured and network-structured storage layouts.

> First normal form and normalization
(this is towards the end of 1.3)
> "So far, we have discussed examples of relations which are defined on
> simple domains-domains whose elements are atomic (nondecomposable)
> values. Nonatomic values can be discussed within the relational
> framework. Thus, some domains may have relations as elements. These
> relations may, in turn, be defined on nonsimple domains, and so on."
>
> It is clear, Codd started 1970 with a design like the
> "document storages" in NOSQL or the N1F systems of the past.

I don't think that it is clear at all. You are looking at it with the "benefit" of hindsight starting from a particular point of view. You may be in danger of presenting a circular argument.

> For reasons not comprehensible any more (Codd's reference is
> out of print and not online available), he restricted his model

Which reference are you talking about here?

> 1.4. NORMAL FORM
> "A relation whose domains are all simple can be represented in storage
> by a two-dimensional column-homogeneous array of the kind discussed
> above. Some more complicated data structure is necessary for a relation
> with one or more nonsimple domains. For this reason (and others to be
> cited below) the possibility of eliminating nonsimple domains appears
> worth investigating! There is, in fact, a very simple elimination
> procedure, which we shall call normalization."
>
> Reading more than enough horror stories about program failures
> based on "index out of bound" it was reasonable to keep the
> design simple and avoid complex dynamic data structures. Meanwhile
> are complex dynamic data structures (trees, graphs, lists... )
> part of standard liraries for common mainstream programing
> languages.

"Index out of bounds" has nothing to do with it. This is not only about simplicity of representation and storage or transfer of data between systems, but about avoiding storage structures that leave "read everything" as the only way to get to some piece of data, and, in fact, about ensuring that all data is created equal without making you look at it all at once.

> Last but not least
> "Future users of large data banks must be protected from having to know
> how the data is organized in the machine (the internal representation)"
>
> For me as a programmer this sounds like a textbook example for
> object design. If objects are the "best" way to implement
> relations, then do it the relational way is like driving a car
> with 5 gears and only using 3 gears.

Two different things that both hide information. The relational model hides the implementation of storage, an object system hides processing and data, at as many levels as you like. Hiding data is only useful because the hidden processing uses it, and any higher level doesn't need to. As soon as the data is required outside, particularly if it is needed as part of retrieval criteria, the fact that it is hidden is bad. The trouble is that you can't predict when some new use will be found (_Shared_ Data Banks, remember?).

On the other hand the only time you need to unhide the storage implementation is when it needs to be changed for performance reasons and then, because of the relational model, you do not have to change any of your programs at all.

Eric

-- 
ms fnd in a lbry
Received on Sun Sep 01 2013 - 19:17:56 CEST

Original text of this message