Re: U_SUMMARIZE: how aggregate is calculated?
Date: Fri, 03 Dec 2004 05:54:27 GMT
Message-ID: <ncTrd.6477$u81.4836_at_newsread3.news.pas.earthlink.net>
Mikito Harakiri wrote:
> "Alfredo Novoa" <alfredo_at_ncs.es> wrote:
>>I that case Sum(1) would be equivalent to Count(R) and the result >>would be 3.
>
> After reading "Temporal Data" my impression was that relation cardinality
> doesn't matter anymore. Although facts are still stored as tuples, they are
> easily merged with each other (pack) or partitioned into smaller ones
> (unpack).
>
> Let me try stating the problem formally. Given
>
> R: {A=[1,2], A= [3,5], A= [4,7]}
>
> what is the result of
>
> USING ( A ) <| SUMMARIZE R
> PER TABLE_DEE
> ADD SUM(1) AS THE_SUM_IN_QUESTION |>
Dear Mikito,
I'm glad to see you largely ignored my contribution, other than to re-express your query more precisely. Mikharakiri partially debunked my answer, and I've finished the job in my response to that posting. I was not thinking straight last night! (Maybe 75% was on target, being largely a regurgitation of the textbook, but the residue was awful.)
I believe the answer to this precisely stated question is
RELATION { TUPLE { THE_SUM_IN_QUESTION 7 } }
You are working with a degenerate case - your input relation only has
the interval-valued column. Normally, the input relation would have
other columns too.
The U_SUMMARIZE operation you've specified is a short-hand for:
PACK
Unpacking any relation on the empty set of columns doesn't alter the
relation, so UNPACK TABLE_DEE ON ( ) is the same as TABLE_DEE.
Similarly, the outer PACK is a no-op. So, the overall expression is
equivalent to:
SUMMARIZE ( UNPACK R ON ( A ) )
( SUMMARIZE ( UNPACK R ON ( A ) )
PER ( UNPACK TABLE_DEE ON ( ) )
ADD SUM(1) AS THE_SUM_IN_QUESTION
)
ON ( )
PER TABLE_DEE ADD SUM(1) AS THE_SUM_IN_QUESTION
I'm sure we agree that UNPACK R ON (A) yields a relation with 7 tuples, each of which is a unit-interval.
As I tried to say in my other posting this evening (using horribly sloppy language, I now see):
When you use SUMMARIZE R1 PER R2, the attributes in R2 must all also
appear in R1 (same name, same type). For each tuple in R2, the set of
tuples from R1 with the same values as the tuple in R2 are treated as
a group, and the aggregate operation runs over all the tuples in the
group.
Also, the output relation from SUMMARIZE R1 PER R2 consists of the
columns from R2 (in this case, no columns) plus the added <summary> -
in this case the answer 7. There is one tuple in the output for each
tuple in R2. Now, TABLE_DEE clearly has no attributes, and the
relation with 7 tuples each containing a unit intervals shares the
empty set of attributes with TABLE_DEE, and SUM(1) over those members
produces the answer 7.
Further, the output relation from SUMMARIZE R1 PER R2 consists of the
columns from R2 (in this case, no columns) plus the added <summary> -
in this case the answer 7. There is one tuple in the output for each
tuple in R2, and TABLE_DEE contains just one tuple, so the result
relation also contains just one.
> Unlike ordinary SUMMARIZE there is a USING clause. Does it have any
> effect at all?
Yes; if you evaluated:
SUMMARIZE R PER TABLE_DEE ADD SUM(1) AS THE_SUM_IN_QUESTION the answer would be:
RELATION { TUPLE { THE_SUM_IN_QUESTION 3 } } because there are three tuples in R.
-- Jonathan Leffler #include <disclaimer.h> Email: jleffler_at_earthlink.net, jleffler_at_us.ibm.com Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/Received on Fri Dec 03 2004 - 06:54:27 CET