Re: GROUP BY

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sat, 19 May 2007 18:37:08 -0400
Message-ID: <p_K3i.1155$C96.245_at_newssvr23.news.prodigy.net>


"Marshall" <marshall.spight_at_gmail.com> wrote in message news:1179597932.006120.282290_at_q75g2000hsh.googlegroups.com...
> On May 19, 7:38 am, "Brian Selzer" <b..._at_selzer-software.com> wrote:
>> "Marshall" <marshall.spi..._at_gmail.com> wrote in message
>>
>> > So: what *about* that extended extend?
>>
>> > Extend is a particular form of join, yes? So perhaps we
>> > ought to apply generators with join?
>>
>> > I note that there is an issue with regards to the uniqueness
>> > of values produced by a generator. It's not clear to me if
>> > that ought to be required or not. If it's required, it's not clear
>> > to me if it's enforceable, since I don't see how to prove uniqueness.
>>
>> > If the outputs of the generator aren't unique, then we have
>> > a function from a tuple to a bag, which also matches the
>> > fact that aggregate functions are functions from a bag to
>> > a tuple.
>>
>> I don't agree with this. An aggregate function iterates over a relation,
>> taking a projection of each tuple to find the values required to
>> calculate
>> the result.
>
> Well, that's not really incompatible with what I wrote. You've given
> an operational description; mine is more ... mathematical? abstract?
>

Abstract, yes. Perhaps a bit too abstract. The way I see it, an aggregate function transforms an entire relation, not just a collection of values. To be specific, an aggregate function extends a relation with an additional attribute whose values summarize the values for one or more existing source attributes. The existing source attributes can then be projected away, if desired, using a separate operation.

> Yours is maybe a bit narrow, though, because by being so specific
> about the algorithm to implement aggregation, you've eliminated
> the possibility for parallelization.
>

Perhaps. Iteration is only one way to implement a function that involves quantification.

>
>> While it is convenient to visualize the collection of values
>> for an attribute in a relation as a bag, it should not be ignored that
>> any
>> particular combination of values in that bag depends solely on the
>> existence
>> of a particular *set* of tuples.
>
> Um, I don't see how. Anyway, we can certainly consider sum() in
> the abstract. It is a function taking a bag of numbers to a single
> number.
> In some languages it is considered to take a list, but that too is
> overspecifying, since order is not significant.
>

You're right, you don't need to know how the bag of numbers came into being in order to compute the sum; on the other hand, you do need to know how the bag of numbers came into being in order to invert the process.

>
>> > Ex: a function n_ones(n), which takes a natural number and
>> > generates the number 1 that many times. This is a subset of
>> > the inverse of sum().
>>
>> > n_ones(3) = {| 1, 1, 1 |}
>>
>> > sum( {| 1, 1, 1 |} = 3
>>
>> > So generators are the inverses of aggregates, and join is the inverse
>> > of group by, and generators are applied with join and aggregates
>> > are applied with group-by.
>>
>> Frankly, I don't see how there can even be an inverse for an aggregate.
>> When you aggregate many distinct values into one, information is lost.
>> How
>> can you possibly reverse the process without that information? For
>> example,
>> if you have 30 students in a class where 'C' is the average grade, how
>> can
>> you tell who aced the course or who failed?
>
> The situation is no different with proper functions. Some are
> invertible, some
> are not. Any bijection is invertible. The function f(x) = 5 is not.
>
> Some aggregates are invertible. n_ones() is invertible, don't you
> think?
>
As you define it, yes. The problem with n_ones() is that once the bag has been produced, how are the values applied to the original relation? I think you might end up with duplicates. Perhaps a better example than n_ones() would be the PACK/UNPACK operators that Date, Darwin and Lorentzos describe for relations with discrete intervals. In an unpacked relation, each tuple contains a unit interval, whereas in a packed relation, no two tuples have intervals that meet or overlap. The difference between PACK/UNPACK and aggregates in general is that a packed relation is equivalent to its corresponding unpacked relation.

>
> Marshall
>
Received on Sun May 20 2007 - 00:37:08 CEST

Original text of this message