Re: U_SUMMARIZE: how aggregate is calculated?

From: Jonathan Leffler <jleffler_at_earthlink.net>
Date: Fri, 03 Dec 2004 05:54:27 GMT
Message-ID: <ncTrd.6477$u81.4836_at_newsread3.news.pas.earthlink.net>


Mikito Harakiri wrote:
> "Alfredo Novoa" <alfredo_at_ncs.es> wrote:

>>I that case Sum(1) would be equivalent to Count(R) and the result
>>would be 3.

>
> After reading "Temporal Data" my impression was that relation cardinality
> doesn't matter anymore. Although facts are still stored as tuples, they are
> easily merged with each other (pack) or partitioned into smaller ones
> (unpack).
>
> Let me try stating the problem formally. Given
>
> R: {A=[1,2], A= [3,5], A= [4,7]}
>
> what is the result of
>
> USING ( A ) <| SUMMARIZE R
> PER TABLE_DEE
> ADD SUM(1) AS THE_SUM_IN_QUESTION |>
Dear Mikito,

I'm glad to see you largely ignored my contribution, other than to re-express your query more precisely. Mikharakiri partially debunked my answer, and I've finished the job in my response to that posting. I was not thinking straight last night! (Maybe 75% was on target, being largely a regurgitation of the textbook, but the residue was awful.)

I believe the answer to this precisely stated question is

    RELATION { TUPLE { THE_SUM_IN_QUESTION 7 } } You are working with a degenerate case - your input relation only has the interval-valued column. Normally, the input relation would have other columns too.

The U_SUMMARIZE operation you've specified is a short-hand for:

PACK
    ( SUMMARIZE ( UNPACK R ON ( A ) )

      PER ( UNPACK TABLE_DEE ON ( ) )
      ADD SUM(1) AS THE_SUM_IN_QUESTION

    )
ON ( )

Unpacking any relation on the empty set of columns doesn't alter the relation, so UNPACK TABLE_DEE ON ( ) is the same as TABLE_DEE. Similarly, the outer PACK is a no-op. So, the overall expression is equivalent to:

SUMMARIZE ( UNPACK R ON ( A ) )

      PER TABLE_DEE
      ADD SUM(1) AS THE_SUM_IN_QUESTION

I'm sure we agree that UNPACK R ON (A) yields a relation with 7 tuples, each of which is a unit-interval.

As I tried to say in my other posting this evening (using horribly sloppy language, I now see):

When you use SUMMARIZE R1 PER R2, the attributes in R2 must all also appear in R1 (same name, same type). For each tuple in R2, the set of tuples from R1 with the same values as the tuple in R2 are treated as a group, and the aggregate operation runs over all the tuples in the group.

Also, the output relation from SUMMARIZE R1 PER R2 consists of the columns from R2 (in this case, no columns) plus the added <summary> - in this case the answer 7. There is one tuple in the output for each tuple in R2. Now, TABLE_DEE clearly has no attributes, and the relation with 7 tuples each containing a unit intervals shares the empty set of attributes with TABLE_DEE, and SUM(1) over those members produces the answer 7.

Further, the output relation from SUMMARIZE R1 PER R2 consists of the columns from R2 (in this case, no columns) plus the added <summary> - in this case the answer 7. There is one tuple in the output for each tuple in R2, and TABLE_DEE contains just one tuple, so the result relation also contains just one.

> Unlike ordinary SUMMARIZE there is a USING clause. Does it have any
> effect at all?

Yes; if you evaluated:

     SUMMARIZE R PER TABLE_DEE ADD SUM(1) AS THE_SUM_IN_QUESTION the answer would be:

    RELATION { TUPLE { THE_SUM_IN_QUESTION 3 } } because there are three tuples in R.

-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler_at_earthlink.net, jleffler_at_us.ibm.com
Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/
Received on Fri Dec 03 2004 - 06:54:27 CET

Original text of this message