GROUP BY

From: Marshall <marshall.spight_at_gmail.com>
Date: 18 May 2007 11:32:46 -0700
Message-ID: <1179513166.858089.73940_at_k79g2000hse.googlegroups.com>



GROUP BY works on a single relation. The input relation's attributes are partitioned into one of three categories:
  1. those we ignore
  2. those we group by
  3. those we apply the function to

The result relation will have attributes from 2) above, and in addition:

4) those created as a result of the applied function.

The functions mentioned in 4) may only use as arguments the attributes listed in 3).

So for example given R(a, b, c):

    select b, sum(c) as d from R group by b

will ignore attribute(s) a, group by attribute(s) b, and apply a function to attribute(s) c, yielding the additional attribute d. The result will have attributes (b, d).

One of my favorite lessons from TTM was always to consider the 0-arity case.

It is not unusual to consider the case of 1) or 2) being empty. In the first case, we are either grouping by or aggregating over all attributes. In the second, we aren't grouping, but are just aggregating over the entire relation; the result will have a single row.

What if 3) is empty?

In that case, GROUP BY is equivalent to PROJECT!

SQL obscures this fact, due to the templated nature of a SQL query. The 3)-empty case is:

    select b from R group by b

which would just be simplified to:

    select b from R

which is how you project in SQL already.

What if 4) uses not an aggregate function, but just a regular function?

In that case, GROUP BY is equivalent to APPLY! (What TTM calls "COMPOSE".) Recall the example from TTM, from the section "Treating Operators as Relations" (ch. 4).

    TWO_AND_TWO <COMPOSE> PLUS

You may immediately object that this introduces the necessity of duplicate elimination, but <COMPOSE> already had this requirement. (If you have a set of x values, it is not the case that every f(x) will be distinct, unless f is one-to-one.)

I have been thinking about the "reverse" of aggregate functions, if you will: "functions" that take a single input but produce multiple outputs. They are called "generators" in some languages. (They are not strictly functions.) It has lately occurred to me that perhaps a single construct could contain all of project, apply, group by, and a mechanism for applying generators.

To my shock, reviewing ch. 4 for this note, I found at the bottom of page 53, TTM says "One implication of this example is that a relational language such as D might reasonably include an extended form of EXTEND that--unlike the traditional EXTEND--is not necessarily limited to producing aexactly one output tuple from each input tuple."

So it appears I am not the first to think something like this.

Discussion welcome.

Marshall Received on Fri May 18 2007 - 20:32:46 CEST

Original text of this message