Re: Idempotence and "Replication Insensitivity" are equivalent ?

From: Chris Smith <cdsmith_at_twu.net>
Date: Tue, 19 Sep 2006 20:38:52 -0600
Message-ID: <MPG.1f7a57acc83a3133989720_at_news.altopia.net>


Reordering to avoid repetition...

William Hughes <wpihughes_at_hotmail.com> wrote:
> Why do you insist on restricting the name "aggregate function"?

The primary reason is that Pamela posted her question here because she was discussing this with Marshall in comp.databases.theory. The statement that Marshall made, which prompted this whole thing, was:

    "The term is idempotent. If the binary form of the aggregate

     function is idempotent, the aggregate will return the same value
     even if values are repeated arbitrarily. Since + is not idempotent,
     sum() is "sensitive" to repeated values. Since binary min *is*
     idempotent, aggregate min() is not "sensitive" to repeated
     values."

I was (am) trying to address that statement. Since it's not intuitively obvious what is meant by "the binary form of an aggregate function" for some aggregate functions (like COUNT, for example), and because the ambiguous phrase isn't very amenable to considering the matter from the perspective of mathematics, I attempted to formalize what is meant by the binary form of an aggregate function.

This is not, incidentally, something I'm making up. I am *definitely* making up the details of the definition, because I've been unable to find a formal definition from a reputable source, but the general idea that an aggregate function is calculated as g(x1,g(x2,g(x3,...g(xn,e)))) is not mine. It's fairly standard, and it's what Marshall was talking about in that statement. I was simply trying to make it more explicit.

If anyone is attached to the definition of "aggregate function" provided by a vendor like Oracle, they are welcome to be so attached. It doesn't change the fact, though, that Marshall was talking about something more restricted in this thread. It also doesn't make Oracle an authority of database theory, terminology or otherwise. They have their own goals, and their goals are related more toward helping people use their product to write applications than helping people understand how aggregate functions are defined at a theoretical level.

> Moreover, consider the task, "find the five largest elements of this
> list".
> This can be done in a single pass, using only a small amount
> of memory. Yet this is not an aggregate function under
> your definition.

Why not? Perhaps a mistake of mine is causing difficulties. Marshall's definition elsewhere in the thread is probably clearer. It distinguishes between two sets A (members of the multiset) and B of (results of the aggregate function), which makes this clearer. In that case, f: M(A) -> B and the corresponding binary function is g: B x A -> B. Also, my intermediate function would be h: A x M(A) -> B and x_0 would be in B. Does that resolve the difficulty you see here?

> > The confusion here is that it was proposed that one could (under the
> > definitions expressed so far) write an aggregate function that makes a
> > copy of the data, and then does whatever calculation it likes after it's
> > got all the data.
>
> No you can't (at least under your definition). An aggregate function
> maps to A, a copy function maps to M(A).

Again, perhaps Marshall's correction resolves the difficulty you see here. If not, then I'm confused.

> You have not aswered the question "What is an efficient aggregate
> function?".

Assuming that the above (an aggregate function that collects the result into a big auxiliary set) is possible, I suppose one definition of an efficient aggregate function would be one that doesn't do that. In particular, one with requires an amount of space less than some constant regardless of the size of its input.

> >Yes, the functions are defined that way because
> > it's an efficient way to do things. The comment here is that variance
> > can be done that way (a surprise to me, but I'm no great student of
> > statistics).
>
> You don't need to be. The fact that variance is a function
> of the first couple of moments is elementary.

Yep, I see that now. Sorry if it cause confusion earlier.

> Yes, Median cannot be put into the form of your
> aggregate function. Despite this median is very important
> and can be computed efficiently.

Can you clarify what you mean by computed efficiently? Perhaps I'm wrong again.

-- 
Chris Smith
Received on Wed Sep 20 2006 - 04:38:52 CEST

Original text of this message