Re: U_SUMMARIZE: how aggregate is calculated?

From: Mikito Harakiri <mikharakiri_at_iahu.com>
Date: Fri, 3 Dec 2004 09:36:06 -0800
Message-ID: <1F1sd.35$I%2.36_at_news.oracle.com>


"Jonathan Leffler" <jleffler_at_earthlink.net> wrote in message news:ncTrd.6477$u81.4836_at_newsread3.news.pas.earthlink.net...
> Mikito Harakiri wrote:
> > Given
> >
> > R: {A=[1,2], A= [3,5], A= [4,7]}
> >
> > what is the result of
> >
> > USING ( A ) <| SUMMARIZE R
> > PER TABLE_DEE
> > ADD SUM(1) AS THE_SUM_IN_QUESTION |>
>
> I believe the answer to this precisely stated question is
>
> RELATION { TUPLE { THE_SUM_IN_QUESTION 7 } }
OK. I see only one issue here. The result depends on the discretization granularity. In the problem description I never mentioned that the intervals have been built out of integers. What if they are rationals? Most of the queries are invariant to the discretization, however. In other words, the queries are supposed to produce the same answer no matter how fine discretization is.

> You are working with a degenerate case - your input relation only has
> the interval-valued column. Normally, the input relation would have
> other columns too.
>
> The U_SUMMARIZE operation you've specified is a short-hand for:
>
> PACK
> ( SUMMARIZE ( UNPACK R ON ( A ) )
> PER ( UNPACK TABLE_DEE ON ( ) )
> ADD SUM(1) AS THE_SUM_IN_QUESTION
> )
> ON ( )
>
> Unpacking any relation on the empty set of columns doesn't alter the
> relation, so UNPACK TABLE_DEE ON ( ) is the same as TABLE_DEE.
> Similarly, the outer PACK is a no-op. So, the overall expression is
> equivalent to:
>
> SUMMARIZE ( UNPACK R ON ( A ) )
> PER TABLE_DEE
> ADD SUM(1) AS THE_SUM_IN_QUESTION
>
> I'm sure we agree that UNPACK R ON (A) yields a relation with 7
> tuples, each of which is a unit-interval.

"Init interval"! Perhaps, this is the answer to my previous concern. Do Date et al always use unit discretization, or they allow to perform finer discretization (but, then, the result should be normalized to unit interval)? Note, that the aggregation on the interval domain is essentially Rieman integral definition.

> When you use SUMMARIZE R1 PER R2, the attributes in R2 must all also
> appear in R1 (same name, same type). For each tuple in R2, the set of
> tuples from R1 with the same values as the tuple in R2 are treated as
> a group, and the aggregate operation runs over all the tuples in the
> group.
>
> Also, the output relation from SUMMARIZE R1 PER R2 consists of the
> columns from R2 (in this case, no columns) plus the added <summary> -
> in this case the answer 7. There is one tuple in the output for each
> tuple in R2. Now, TABLE_DEE clearly has no attributes, and the
> relation with 7 tuples each containing a unit intervals shares the
> empty set of attributes with TABLE_DEE, and SUM(1) over those members
> produces the answer 7.
>
> Further, the output relation from SUMMARIZE R1 PER R2 consists of the
> columns from R2 (in this case, no columns) plus the added <summary> -
> in this case the answer 7. There is one tuple in the output for each
> tuple in R2, and TABLE_DEE contains just one tuple, so the result
> relation also contains just one.

Oh, the result set relation shape -- one column, one tuple -- was obvious from the very beginning. SUMMARIZE definition (by Hugh Darwen?), however, involves 2 relations. It's difficult to develop intuition with that kind of syntax.

> > Unlike ordinary SUMMARIZE there is a USING clause. Does it have any
> > effect at all?
>
> Yes; if you evaluated:
>
> SUMMARIZE R PER TABLE_DEE ADD SUM(1) AS THE_SUM_IN_QUESTION
>
> the answer would be:
>
> RELATION { TUPLE { THE_SUM_IN_QUESTION 3 } }
>
> because there are three tuples in R.

Thanks for clarifying this. It looks like significant part of your effort has been spent fighting syntax:-) Received on Fri Dec 03 2004 - 18:36:06 CET

Original text of this message