Re: Aggregation (with GBY) is Relational Division

From: Marshall <marshall.spight_at_gmail.com>
Date: 4 Jun 2006 19:13:03 -0700
Message-ID: <1149473583.137221.224950_at_g10g2000cwb.googlegroups.com>


vadimtro_at_gmail.com wrote:
> Well, not quite, but surprisingly close.
>
> Consider a relation
>
> Dept
> ^^^^
> Name Pos# Sal
> ---- ---- ---
> Acct 1 100
> Acct 2 150
> Res 1 200
>
> In order to express
>
> select Name, sum(Sal), min(Sal) from Dept
> group by Name
>
> in relational terms we need an infinite relation
>
> Aggr
> ^^^^
>
> Bag# Pos# Sal sum min max
> ---- ---- --- --- --- ---
> 1 1 100 250 100 150
> 1 2 150 250 100 150
> 2 1 150 250 100 150
> 2 2 100 250 100 150
> 3 1 200 200 200 200
> ....
>
> where only the portion of the Aggr relation which is relevant to the
> Dept data is shown.

I have a hard time with this relation. For one thing, I don't see how the third and fourth rows are "relevant to the Dept data."

I think I see what you're trying to do. But I don't see how it's an interesting approach; the Aggr relation is too heavily bound to the current *value* of the Dept relation. (Unless I'm misunderstanding completely.)

Consider a simpler example, where we have only a single column to aggregate over, and we want to find the min. The value of the Aggr relation has every value of the join attribute in column one, and the min value in column two. But what the specific value for the second column is will depend on the current value of the other relation operand; in essence, the Aggr relation is a value parameterized by the relation we're aggregating over. This makes it hard to see how construction of the Aggr relation would proceed. (You'd need aggregates or at least some kind of fold to do so.)

I do not have the same objection to, say, using the infinite less-than relation to join with, because the less-than relation is not parameterized by anything. You always (modulo renaming) use the same relation value.

> The set join /\(=) returns all the tuples (Name, Bug#, sum, min, max)
> such that the set of (Pos#,Sal) pairs for each Name equals to the set
> of (Pos#,Sal) pairs matching with a certain Bag#. In our example the
> table can be rewritten in non 1NF as

I call this operation "apply" or "application" since it's the same as function application. Again, it's a natural join followed by inner union of a zero-element relation that has the symmetric difference of the sets of attributes of the two relation operands. It is a nice property of Apply that it can be expressed in only two RL operations, and the "synthetic" empty relation operand can be constructed statically; that is, without reference to the *value* of the operands to Apply, but only to their types.

An alternative way to do aggregates is to allow natural join with functions that are closures; that is, that have access to variables defined outside their scope. Closures are not pure functions in the mathematical sense; their output is not purely a function of their input, since they have access to those additional variables I mentioned.

This approach brings up a whole host of annoying questions, such as parametric polymorphism and renaming, but it's still interesting, if not altogether practical.

Consider sum.

We have some relation R(a, b) and we want to

SELECT a, sum(b) as result from R group by a

We define an intermediate relation Aggr(a,result)

We define a function f:

f(a,b): () = {
  localA := a;
  if not exists (select * from Aggr where a = localA) {     insert into Aggr values (0); -- the identity   }
  update Aggr set result = result + b where a = localA; }

Now we simply natural join R with f, and the result of the SELECT/GROUP BY above is in Aggr.

The above definition of f proceeds from the names of R and the desired fold function, in this case +. It can be defined statically, and parameterized statically. (Although, as I mentioned, that requires some fancy parametric polymorphism. Mathematically this is a non issues but for programming languages it's not so simple.)

(Amusingly, the above imperative code is likely to match fairly closely the *implementation* of group by. But it's not that great a mathematical model because of the impurity.)

In a completely different, and purely functional approach, for a long time I've been imagining a way of handling aggregates based on this paper:

http://labs.google.com/papers/mapreduce.html

But in order to make the model powerful enough to package up avg() the way you want it, it got unweildly.

Actually, I had kind of a jolt reading something Bob Badour wrote recently about defining aggregates as either a fold or an expression written in terms of folds. (He did not use the term "fold".) Thus, avg(x) can be defined as simply sum(x) / count(). And sum(x) can be defined as simply fold(x,+,0).

I like that much better. For one thing, it makes it much easier to eliminate duplicate folds in a complex group-by. (It's hard to explain why, so I won't bother. It's not clear if this post will interest anyone anyway, the OP excepted. :-)

>From a programming language standpoint, it raises the
question of handling the aggregate expressions. But that's not any theoretical difficulty, and not necessarily all that much design difficulty.

Marshall Received on Mon Jun 05 2006 - 04:13:03 CEST

Original text of this message