Re: Aggregation (with GBY) is Relational Division
Date: 4 Jun 2006 19:13:03 -0700
Message-ID: <1149473583.137221.224950_at_g10g2000cwb.googlegroups.com>
vadimtro_at_gmail.com wrote:
> Well, not quite, but surprisingly close.
>
> Consider a relation
>
> Dept
> ^^^^
> Name Pos# Sal
> ---- ---- ---
> Acct 1 100
> Acct 2 150
> Res 1 200
>
> In order to express
>
> select Name, sum(Sal), min(Sal) from Dept
> group by Name
>
> in relational terms we need an infinite relation
>
> Aggr
> ^^^^
>
> Bag# Pos# Sal sum min max
> ---- ---- --- --- --- ---
> 1 1 100 250 100 150
> 1 2 150 250 100 150
> 2 1 150 250 100 150
> 2 2 100 250 100 150
> 3 1 200 200 200 200
> ....
>
> where only the portion of the Aggr relation which is relevant to the
> Dept data is shown.
Consider a simpler example, where we have only a single column to aggregate over, and we want to find the min. The value of the Aggr relation has every value of the join attribute in column one, and the min value in column two. But what the specific value for the second column is will depend on the current value of the other relation operand; in essence, the Aggr relation is a value parameterized by the relation we're aggregating over. This makes it hard to see how construction of the Aggr relation would proceed. (You'd need aggregates or at least some kind of fold to do so.)
I do not have the same objection to, say, using the infinite less-than relation to join with, because the less-than relation is not parameterized by anything. You always (modulo renaming) use the same relation value.
> The set join /\(=) returns all the tuples (Name, Bug#, sum, min, max)
> such that the set of (Pos#,Sal) pairs for each Name equals to the set
> of (Pos#,Sal) pairs matching with a certain Bag#. In our example the
> table can be rewritten in non 1NF as
I call this operation "apply" or "application" since it's the same as function application. Again, it's a natural join followed by inner union of a zero-element relation that has the symmetric difference of the sets of attributes of the two relation operands. It is a nice property of Apply that it can be expressed in only two RL operations, and the "synthetic" empty relation operand can be constructed statically; that is, without reference to the *value* of the operands to Apply, but only to their types.
Now we simply natural join R with f, and the result of the SELECT/GROUP BY above is in Aggr.
The above definition of f proceeds from the names of R and the desired fold function, in this case +. It can be defined statically, and parameterized statically. (Although, as I mentioned, that requires some fancy parametric polymorphism. Mathematically this is a non issues but for programming languages it's not so simple.)
(Amusingly, the above imperative code is likely to match fairly closely the *implementation* of group by. But it's not that great a mathematical model because of the impurity.)
In a completely different, and purely functional approach, for a long time I've been imagining a way of handling aggregates based on this paper:
http://labs.google.com/papers/mapreduce.html
But in order to make the model powerful enough to package up avg() the way you want it, it got unweildly.
Actually, I had kind of a jolt reading something Bob Badour wrote recently about defining aggregates as either a fold or an expression written in terms of folds. (He did not use the term "fold".) Thus, avg(x) can be defined as simply sum(x) / count(). And sum(x) can be defined as simply fold(x,+,0).
I like that much better. For one thing, it makes it much easier to eliminate duplicate folds in a complex group-by. (It's hard to explain why, so I won't bother. It's not clear if this post will interest anyone anyway, the OP excepted. :-)
>From a programming language standpoint, it raises the
question of handling the aggregate expressions. But
that's not any theoretical difficulty, and not necessarily
all that much design difficulty.
Marshall Received on Mon Jun 05 2006 - 04:13:03 CEST