Re: Aggregates: Largest Groups

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Fri, 9 Apr 2010 15:13:29 -0700 (PDT)
Message-ID: <de83ed0b-4529-4ed0-b812-ef3db723c560_at_c36g2000yqm.googlegroups.com>


On Apr 8, 6:10 pm, Bob Badour <bbad..._at_pei.sympatico.ca> wrote:

> Do you have any specific criticism of Tutorial D with respect to the above?

(Jesus H, this post bloated beyond any recognition... So not proofread either.)

Perhaps the most important thing missing there is an expressive array algebra, in APL style, or maybe the effortless arbitrary recursion capabilities of Prolog. A LISP nerd might also say: the lack of a comprehensive metaprogramming capability which would help you add the features yourself.

As two concrete examples of problems which I think would lead to clunky code under Tutorial D, take the basic convolution stencil calculations used in numerical solving of partial differential equations, or the A* search in, say, a traffic graph. In the first case you'd probably want to model your multidimensional arrays as relations. In SQL and with 2D dense arrays (a prior constraint) it goes something like this:

select d.index1+k.index1, d.index2+k.index2, sum(d.value*k.value) value
from data d, kernel k
where d.index1+k.index1 between (select min(index1) from data) and (select max(index1) from data)

      and d.index2+k.index2 between (select min(index2) from data) and (select max(index2) from data)
group by d.index1+k.index1, d.index2+k.index2;

Undoubtedly you could do the same in Tutorial D, but I'm not sure the code would be any shorter or more elegant, and I know of no way to do it in arbitrary dimension in either that or SQL. By comparison, in APL it's a reasonably tractable one-liner, and on the optimization/ implementation front, even in the fixed dimension case, all of C, Fortran, Common Lisp/Haskell/O'Caml and APL will just kill any relational thingy out there that I know of; even MMDB's like SolidDB and the like. (I'm waiting for Stonebraker and the lot to come through with this one with some future version of SciDB.) I'd summarize this problem as the language a) not being able to express the relevant computational pattern at a high enough/declarative level to make expressing the solution easy/fast/convenient/concise enough, and b) not having a vocabulary high developed/axiomatized/closed/tractable enough to permit heavy optimization based on the underlying formal semantics.

Then the second problem, optimal graph traversal under heuristic guidance. Even the most basic means of encoding arbitrary directional graphs under the relational model calls for an all-key, selfreferring,  binary relation (i.e. "connection from what to what" within the same domain). Any extra data beyond node-to-node links would only make the problem harder after normalization. Now I can already be quite certain: neither older SQL nor Tutorial D can even *express* the constraint that there should be a connection from value a to b (in the same domain), where "connection" means the transitive closure over the binary relation that was so encoded. They can much less encode the intent that the sum over the whole path of weights given in the linkage relation should be minimized. Certainly they have no mechanism to guide the search process which inevitably results. So they're doomed to backtrack into the procedural part of the overall language (the last minor section in the Tutorial D grammar, the PL part of PL/ SQL under Oracle, SQL's procedural extensions, and so on), and to manually reimplement pretty much the whole of the algorithm that is needed to yield the solution. (Btw, this also nicely implements how different the procedural and declarative domains really are: somehow you can't divide labor between them; you have to choose which one is to implement what and then interface. This point goes for both OR/M and Tutorial D-like "homoiconic" languages.)

Don't get me wrong here, I couldn't write a working program in Tutorial D if my life depended on it, so I'm probably biased or maybe even completely mistaken; then correct me. I'm also picky and an idealist in that I haven't yet found a language that truly feels comfortable and powerful enough to me, especially for all problems at the same time; certainly implementing the above kinds of algorithms in C++ or Java is equally painful or more so than Tutorial D.

Still, my main point is that the original reason why Codd came up with the relational model was that too much attention to details hampered programmers' productivity. I think that's a fair summary of why physical data independence came into being: it's just a bad idea to waste people's time in chasing such minute/mundane details as chasing pointers in a loop/recursion, especially via low-level procedural interfaces. It makes much more sense to rise up a notch in abstraction, and let the machine translate back from there. That happens best when the higher level has a formal, regular, mathematical structure which lends itself to algebraic transformation and optimization. As a mathematician, Codd knew this (I've never seen him writing about homomorphisms or such, but he might/should have), and based his vision on a simple/regular/symmetric/well-understood/highlevel  model of data; the RM. And more to the point its abstract qualities: he made up an algebra very much based on the foundations of mathematics, and then started a tradition of axiomatic semantics on top of it: he after all started dependency theory as well (*).

That sort of evolution should continue and not stop with the relational model. As I see it, the RM is not so much about a fixed data model as it is about upping the level of abstraction and the productivity/ease that comes with it. If that's the real goal, Tutorial D is already moribund since it follows the older relational thinking so closely. What I/we need is more power in a programming language, not just solutions to some OO-like programming fad which now incentivizes people to write inefficient storage code. The relational way of thinking about things, along with some chose other paradigms like array processing, logical programming and functional languages, could make a real difference, *while* interfacing well with secondary storage. It's just that the programming advantage seems bigger to me than the storage bottleneck.

*) Most people don't realize it, but Codd early on foreshadowed a few influential developments that were later attributed to others. The most important one was the idea of join dependencies. In "A Relational Model of Data for Large Shared Data Banks" one of his examples clearly describes normalization by a three-fold, indivisible projection. Of course Rissanen explicated an equivalent independence criterion at pretty much the same time, but the point still is that neither of them really became associated with the idea, eventhough they clearly understood it first. (I sort of like to think Codd thought it too obvious to mention, and that Rissanen thought his treatment closed the whole issue. Call that my search for personal heroes if you will.)

Then, even if Codd didn't formalize his thoughts as well as many of his contemporaries in his published pieces(**), his work was best publicized. I think the primary reason why we have to take RM so seriously now was that even that informal influence put through a message that was consistently mathematical in flavor. For example, just about nobody seems to really know where the term "natural join" comes from, today. It really started with the Model paper: there *a* join was defined as an operation whose result projected losslessly back to a number of relations. If you define it that way, axiomatically and derivatively from projection, there are quite a number of operations which fill up the prescription. Yet only one of them is such that it introduces no correlations between the attributes which are not part of "the join key" (hazy when you start like this) but just Cartesian joins them in absentia; that is *the* "natural" one, and the one which is now called an equijoin. (The idea that it has something special to do with modifying the relation header in the presence of equality predicates came later.)

That original definition of a join is thoroughly algebraic and in a sense strictly derived from simpler operations. From a mathematician's aesthetic/praxis it's also pretty much a given that the equijoin should be natural, and that repeated attributes should be collapsed unless the grading of the algebra would suffer (here it doesn't because of the explicit naming and "disorder" of the grades). I'm also reasonably certain that at least Tegiri could come up with a modern, universal algebraic/category theoretical buzzword for the distinction between "natural" and "unnatural" joins in the sense I just described.

I could give a couple of more examples, but let's leave this post at prolix, instead of spam. ;)

**) I've been dying to get a copy of Codd's IBM-internal work and early public papers. If you can link to them or pass a copy, I'd deeply appreciate it. Received on Sat Apr 10 2010 - 00:13:29 CEST

Original text of this message