Re: U_SUMMARIZE: how aggregate is calculated?

From: Jonathan Leffler <jleffler_at_earthlink.net>
Date: Fri, 03 Dec 2004 05:26:20 GMT
Message-ID: <0OSrd.10891$NU3.10596_at_newsread1.news.pas.earthlink.net>


mikharakiri_nospaum_at_yahoo.com wrote:
> Jonathan Leffler wrote:

>>Mikito Harakiri wrote:
>>> Suppose we have one column relation which is essentially a set
>>> of intervals:
>>>{[1,2], [3,5], [4,7]}
>>>
>>> and I want to project away the interval column and calculate 
>>> aggregated value of SUM(1). What the result there would be? 3?
>>> 2? 7?
>>
>> Now, your subject line mentions U_SUMMARIZE, but it is not clear
>> that your question involves it.

>
> Well, isn't SUMMARIZE is the only way to do aggregation in Date et al
> relational syntax?

Well, if you look at Tutorial D in TTM 2nd Edn, you find that there are <agg op inv> and <summary> and although the <agg op names> listed are similar to the <summary spec>, the book goes on to specify that they are different. (An <agg op inv> is a <scalar exp>, which in turn is an <exp>, and <exp> appears in many places in the grammar. <summary spec> only appears in <summary> which is a component of <summarize>.) So one answer is No.

Further, as I mentioned, U_SUMMARIZE is a specific extension of SUMMARIZE discussed in TDRM (and SUMMARIZE is just a special case of U_SUMMARIZE).

>> Suppose your relation is R and the interval attribute (of type 
>> INTERVAL_INTEGER, I'm assuming) is A - not that it matters much.
>> You want to project away the interval column -- OK, that's R { },
>> and the result is a table with one row containing an empty tuple,
>> aka TABLE_DEE. The aggregated value of SUM(1) is 1 because
>> there's one row.
>>   ...oh?...

>
> Nope. "Aggregate-groupby" is an [generalized] projection, right? For
> example,
>
> select empno, sum(1+sal) from emp
> group by empno
>
> is a [generalized] projection to empno. All the other columns of the
> original relation are projected away. After we removed extra column and
> merged duplicates, ordinary projection would have just one column. In
> the extended projection we just do some extra calculation during
> duplicates merge and calculate aggregate expressions.

My interpretation of the question - which I immediately said was probably not the intended interpretation - was that the projection was done first (it is mentioned first) and /then/ calculate the aggregated value. You are attempting a different operation.

>>I don't think that's the question you intended to ask. So, let's
>>rewrite the question to use a U_SUMMARIZE operation.
>>
>>USING ( A) <| SUMMARIZE R PER ?R2? ADD SUM(1) AS B |>
>>
>>What are we to use as 'PER ?R2?'?  Let's assume that we use TABLE_DEE

>
>
>>again (also achieved by omitting the PER clause altogether, I
>>believe), and therefore ACL' is empty.
>>
>>UNPACK TABLE_DEE ON () is just TABLE_DEE.
>>
>>So, the U_SUMMARIZE is a shorthand for...
>>
>>PACK(SUMMARIZE(UNPACK R ON A) PER TABLE_DEE ADD SUM(1) AS B) ON ()
>>
>>Any PACK R ON () operation also yields R, hence we have:
>>
>>SUMMARIZE ( UNPACK R ON A ) PER TABLE_DEE ADD SUM(1) AS B
>>
>>Given the sample data, UNPACK R ON A yields (using your notation with
>>commas in the middle of an interval rather than colons as in TDRM):
>>
>>RELATION { TUPLE { A [1,1] }, TUPLE { A [2,2] }, TUPLE { A [3,3] },
>>            TUPLE { A [4,4] }, TUPLE { A [5,5] }, TUPLE { A [6,6] },
>>            TUPLE { A [7,7] } }

Thus far, I think my answer was correct. Things went off the rails after that, though.

>>Summarizing that PER TABLE_DEE ADD SUM(1) AS B yields, I think:
>>
>>RELATION
>>{ TUPLE { A [1,1], B 1 },
>>   TUPLE { A [2,2], B 1 },
>>   TUPLE { A [3,3], B 1 },
>>   TUPLE { A [4,4], B 1 },
>>   TUPLE { A [5,5], B 1 },
>>   TUPLE { A [6,6], B 1 },
>>   TUPLE { A [7,7], B 1 }
>>}
>>
>>I suspect that this means either (1) I've boobed

I'm now sure that's a bogus expansion and I've boobed.

When you use the SUMMARIZE R1 PER R2, the set of values from R1 that match each tuple in R2 are treated as a unit, and the aggregate runs over that unit - for some ill-defined meaning of the term 'unit'. Now, all the typles in A share the same value as the only tuple in TABLE_DEE, so there is a single 'unit' with 7 members, and the SUM(1) over those members produces the answer 7.

The output relation from SUMMARIZE R1 PER R2 consists of the columns from R2 (in this case, no columns) plus the added <summary> - in this case the answer 7. There is one tuple in the output for each tuple in R2.

So, the RELATION expression above is wrong - flat out wrong. The answer should be a single TUPLE { B 7 }, where B is of type INTEGER or RATIONAL.

>>[..] or (2) I've not used the right U_SUMMARIZE expression.

>
> Why your output relation has 2 columns?

Becaue I got it wrong. I knew that last night a couple of hours after I'd posted it; I didn't have a chance to correct myself until now.

> We project away the interval
> column A, right? Therefore, if we were calculating ordinary projection
> then we would get a relation with no columns at all. The generalized
> projection would include extra aggregate expression column. Therefore
> the result should be a single column relation.

Agreed.

> In order to avoid further misinterpretation, let me write the query in
> SQL
>
> select sum(1) from R
>
> Wait a minute, isn't SQL supposed to be verbose compared to Tutorial D?

That's a different query because it hasn't done the unpacking on the intervals, but it is no worse than my original answer. Since Mikito asked the question again more precisely, I'll leave further discussion to that posting.

-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler_at_earthlink.net, jleffler_at_us.ibm.com
Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/
Received on Fri Dec 03 2004 - 06:26:20 CET

Original text of this message