Re: Idempotence and "Replication Insensitivity" are equivalent ?

From: William Hughes <wpihughes_at_hotmail.com>
Date: 19 Sep 2006 21:18:26 -0700
Message-ID: <1158725906.139023.160010_at_m7g2000cwm.googlegroups.com>


Chris Smith wrote:
> Reordering to avoid repetition...
>
> William Hughes <wpihughes_at_hotmail.com> wrote:
> > Why do you insist on restricting the name "aggregate function"?
>
> The primary reason is that Pamela posted her question here because she
> was discussing this with Marshall in comp.databases.theory. The
> statement that Marshall made, which prompted this whole thing, was:
>
> "The term is idempotent. If the binary form of the aggregate
> function is idempotent, the aggregate will return the same value
> even if values are repeated arbitrarily. Since + is not idempotent,
> sum() is "sensitive" to repeated values. Since binary min *is*
> idempotent, aggregate min() is not "sensitive" to repeated
> values."
>

OK. You are following Marshall's nomenclature. Presumably Marshal has some reason for restricing aggregate functions. But you do not seem to know what this is.

> I was (am) trying to address that statement.

As has been made clear the statement is fine as long as we restrict the concept of aggregate function. So the qestion of why we do this becomes central.

> Since it's not intuitively
> obvious what is meant by "the binary form of an aggregate function" for
> some aggregate functions (like COUNT, for example), and because the
> ambiguous phrase isn't very amenable to considering the matter from the
> perspective of mathematics, I attempted to formalize what is meant by
> the binary form of an aggregate function.
>

And since not all functions on M(A) have binary forms, this will perforce mean restricing the definition of an aggregate function.

> This is not, incidentally, something I'm making up. I am *definitely*
> making up the details of the definition, because I've been unable to
> find a formal definition from a reputable source, but the general idea
> that an aggregate function is calculated as g(x1,g(x2,g(x3,...g(xn,e))))
> is not mine. It's fairly standard, and it's what Marshall was talking
> about in that statement. I was simply trying to make it more explicit.
>

That the study of functions can be writting in the form g(x1,g(x2,g(x3,...g(xn,e)) is standard is reasonable. To suggest that only these functions are interesting is not. This is what you seem to be doing by restricting the use of the term aggregate function.

> If anyone is attached to the definition of "aggregate function" provided
> by a vendor like Oracle, they are welcome to be so attached. It doesn't
> change the fact, though, that Marshall was talking about something more
> restricted in this thread.

The only thread I am familiar with is the one in sci.math. The argument that "replication insensitive functions and idempotent functions are the same because we are limiting ourselved to a class of functions where they are the same" is of course technically valid but carries little weight unless you give a reason for the restriction.

> It also doesn't make Oracle an authority of
> database theory, terminology or otherwise. They have their own goals,
> and their goals are related more toward helping people use their product
> to write applications than helping people understand how aggregate
> functions are defined at a theoretical level.
>

No, but this is a data point that suggests the standard usage of aggregate function is non-restrictive.

> > Moreover, consider the task, "find the five largest elements of this
> > list".
> > This can be done in a single pass, using only a small amount
> > of memory. Yet this is not an aggregate function under
> > your definition.
>
> Why not? Perhaps a mistake of mine is causing difficulties. Marshall's
> definition elsewhere in the thread is probably clearer. It
> distinguishes between two sets A (members of the multiset) and B of
> (results of the aggregate function), which makes this clearer. In that
> case, f: M(A) -> B and the corresponding binary function is
> g: B x A -> B. Also, my intermediate function would be h: A x M(A) -> B
> and x_0 would be in B. Does that resolve the difficulty you see here?
>

Here we have no restriction whatsoever on the output of an aggregate function. This seems much more reasonable than your very restrictive definition in which an aggregate function could only map into A. Still there is a second more fundamental issue. What could the binary form of a function that find the largest five elements be?

> > > The confusion here is that it was proposed that one could (under the
> > > definitions expressed so far) write an aggregate function that makes a
> > > copy of the data, and then does whatever calculation it likes after it's
> > > got all the data.
> >
> > No you can't (at least under your definition). An aggregate function
> > maps to A, a copy function maps to M(A).
>
> Again, perhaps Marshall's correction resolves the difficulty you see
> here. If not, then I'm confused.
>

Only partially. It does remove the difficulty I mentioned. However, even after
making the change (which involves changing the range of g) you still can not produce a copy function (look at the domain of g).

> > You have not aswered the question "What is an efficient aggregate
> > function?".
>
> Assuming that the above (an aggregate function that collects the result
> into a big auxiliary set) is possible, I suppose one definition of an
> efficient aggregate function would be one that doesn't do that. In
> particular, one with requires an amount of space less than some constant
> regardless of the size of its input.
>
> > >Yes, the functions are defined that way because
> > > it's an efficient way to do things. The comment here is that variance
> > > can be done that way (a surprise to me, but I'm no great student of
> > > statistics).
> >
> > You don't need to be. The fact that variance is a function
> > of the first couple of moments is elementary.
>
> Yep, I see that now. Sorry if it cause confusion earlier.
>
> > Yes, Median cannot be put into the form of your
> > aggregate function. Despite this median is very important
> > and can be computed efficiently.
>
> Can you clarify what you mean by computed efficiently? Perhaps I'm
> wrong again.
>

Just that it is practical to compute the median routinely. I did not have anything more formal in mind. My point is that the median function is clearly of practical importance, so defining aggregate function in such a way as to exclude the median needs justification.

My original contribution to this thread was that if you admit CountUnique to the class of functions you consider "replication insensitive", then "replication insensitive" does not mean the same thing as"idempotent". Your reply was that the OP's post was a misrepresentation of another thread in which a class of functions that did not include CountUnique was being discussed. Having not seen this thread I cannot comment on the validity of your claim. However, I do note that you seem unable to justify the exclusion.

  • William Hughes
Received on Wed Sep 20 2006 - 06:18:26 CEST

Original text of this message