Re: Idempotence and "Replication Insensitivity" are equivalent ?

From: Marshall <marshall.spight_at_gmail.com>
Date: 20 Sep 2006 17:09:18 -0700
Message-ID: <1158797358.143842.254890_at_e3g2000cwe.googlegroups.com>


Chris Smith wrote:
> Marshall <marshall.spight_at_gmail.com> wrote:
> > You post was wonderful, as always, but it's a bit narrow in
> > terms of what it can define, and also I think a modestly
> > higher-order approach would simplify things a bit. Also
> > I'm not sure your introduction of the identity element was
> > done in exactly the right manner.
>
> > As to the higher-orderness, I think I would prefer to
> > build everything with generic folds, rather than define
> > a new inductive h() for every g on the way to defining f.
> >
> > fold : (f: (A, B -> B), S : M(A), e : B) -> B
> >
> > fold f {} e = e
> > fold f {a} e = f(a, e)
> > fold f {a union P(A)} e = f(a, fold2(f, P(A), e))
>
> Did you mean fold instead of fold2 there? If not what's fold2? If so,
> read on.

I meant "fold." The "2" was left over from an earlier edit, in which I was differentiating between two different ways to do the folds. Which I mention now, why not?

The version of fold that requires an e element is not the entire story, although it mostly is. There are cases where we do not have an identity nor a starting point, and so we'd be in trouble if we required one.

Consider min(). Min over unsigned 32 bit ints has an identity: 2^32. However, min over arbirtrary precision ints does not have an identity. Oops! So does it really matter? If n (the size of the input) is 2 or more, we can just do pairs of ints, and if n is 1, we can just return the one element. But what if n is 0? We can define useful results for the sum or the product of an empty set of 32 bit ints or arbitrary precision ints, but there's no useful result for min of an empty set of arbitrary precision ints.

So an alternative version of fold is called for.

Standard fold:
requires a starting point value
has type A, B -> B
if input is {}, returns e
in input is {a}, returns f(a, e)

Alternate fold:
does not take a starting point value
has type A, A -> A
if input is {}, diverges
if input is {a}, returns a

> Okay. So far as I can see, this is equivalent to what I described,
> except that you differentiate between the sets A and B. I admit to
> combining A and B for ulterior motives; I thought it would simplify the
> definition of idempotence of f (my g), and I didn't think we'd lose
> anything since I figured we could just do a union of your A and B, and
> call the result A. I believe I was wrong, because it doesn't allow for
> the case where M(A) is a subset of B, but that was the intent.
>
> > f is invoked a number of times = cardinality of S.
> > If S is empty, fold returns e.
>
> Yes, that was my first mistake.

I wouldn't call it a mistake, given what I said above.

> > f is not required to be commutative and associative,
> > but if it isn't, then it is likely that the result will
> > depend on the order of selection of elemens of
> > S
>
> Right. Given that f is associative and commutative, you are guaranteed
> that fold(f,...) is well-defined. (Proof is fairly simple by
> induction.) Also, in order for fold(f,...) to be well-defined, it is
> necessary that *either* f must be associative and commutative, *or* e
> cannot be an identity of f. (Proof: Let e be an identity of f. First
> consider possible evaluations of fold(f,{a,b},e). They must be equal to
> each other, so f must be commutative in order for fold(f,...) to be
> well-defined. Now consider fold(f,{a,b,c}). Using the commutative
> property as established above, f(f(a,b),c) must be equal to f(a,f(b,b)),
> so f is associative as well.) That is still not a sufficient condition
> for well-definedness in the case where e is not an identity, but I don't
> see an easy sufficient condition.

I'm not really following this. Your either/or above is supposed to be an exclusive or?

> > , and in such cases it would be better to
> > consider S : [A] rather than M(A).
>
> Whether it's better to consider that or not depends on what we're trying
> to do. If we are trying to characterize the results of an aggregate
> function on a relation attribute, then it's actually quite irrelevant, I
> would think.

Given that relation elements are unordered by definition, yes. But we may also wish to perform aggregates on ordered relations. (That is, on a (relation, total order) pair.)

> > If f is commutative and associative, as usual, and
> > also idempotent, then a fold over f will produce the
> > same result, regardless of any repeated values in S.
>
> Yes. That's what was shown earlier (with a different definition, but
> one that's equivalent under those conditions). Furthermore, if fold is
> definable as a well-defined function where e is an identity of f, then f
> is idempotent (and associative, and commutative).
>
> > Once we have fold, we need a mechanism whereby
> > the repeated evaluations of the aggregate function
> > on the scalars in a relation can be translated into a
> > fold over the multiset of all the scalars. I'll just wave
> > my hands for that part.
>
> What seems like "hand-waving" here? The act of extracting a column of a
> relation as a multiset? I don't see why that requires any additional
> clarification.

That's it. I'm glad it was obvious.

> > Note that this doesn't cover all aggregates. We can't
> > do avg() with this approach. We could do these kinds
> > of aggregates with an accumulator, but then we'd
> > need some mechanism to apply a function to the
> > components of the accumulator to get the desired
> > result at the end.
>
> Indeed. In fact, avg() needs two evaluations: one to count the
> elements, and another to add them up. I spoke a friend in the local CS
> department here last night, and he proposed the definition you are
> responding to (which defines primitive aggregate functions, and then
> defines aggregate functions as simple combinations of the values of
> primitive aggregate functions) mainly in order to solve this problem.

There are other useful results from this model than just solving the "problem of avg()" though!

> > Count distinct is a bit of a bastard.
>
> As Bob mentioned, the distinct operator is more of a transformation on a
> projection of the relation than a transformation on an aggregate
> function. Although technically possible, I don't think it does anyone
> any good to define it as an aggregate function.

It does seem perverse.

Marshall Received on Thu Sep 21 2006 - 02:09:18 CEST

Original text of this message