# Re: efficient compare

Date: Tue, 25 Apr 2006 15:23:42 +0200

Message-ID: <444E22DE.6040403_at_hotmail.com>

I am glad someone is actually commenting on my algorithm.

Bob Badour wrote:

*> >
*

> [snipped example]

*>
**> We are dealing with three numbers: C, N, and M. C is the cardinality of
**> the relation and is linearly proportional to the size of the relation. N
**> is the size of the hash table, which is arbitrarily chosen as a large
**> number. M is the number of changes since the last synch and is linearly
**> proportional to the size of the log file.
*

C is the cardinality of the sets right? The only relations I know are the tuples (which form a subset of SxS where S is whatever the components of the tuple are).

> We established earlier that C >> M.

Not necessary. I want a generic algorithm which works under all circumstances, where you pay a "cost proportional to the amount of asynchrony".

> Assuming a uniform distribution and C > N, one can expect to find C/N

*> tuples in each bucket. It makes no sense to choose N < M, because one
**> will expect to find M/N > 1 updates in each bucket meaning you will send
**> O(2^N) messages plus O(C) tuples. If we choose N > M and have a uniform
**> distribution of changes in buckets, we can expect O(M) <= M buckets to
**> have updates.
*

Still with you... also agree with assumption of N > M.

> In the example snipped, you give M = 1 and log(N) = 160. I simply

*> observe that 160 >> 1. For M << N, your example suggests you will need
**> to send O(M * log(N)) messages plus O(M * C/N) tuples, which is much
**> larger than the logfile O(M).
*

I agree to the cost of my algo being

O(M * log(N)) [for transfering M checksum path of the checksum tree) +
O(M * C/N) [cost of M buckets each containing C/N items]

Some questions that come to my mind:

The log would grow beyond C, right? As time goes, the size of the log
grows to inifinity, even with a fixed C? Then you have to find ways
around that with checkpoints etc?

If the log can be really big, how do you know how much of it to transfer? Lets say the checksum of the two logs on the two computers mismatch, how much of it should we send?

My main question: If we have several machines, and they are each making updates locally, and trying to synchronize all with eachother (running your pairwise log synch algo), the logs are not just this monotonically increasing log.

At time t nodes A and B and C have the log X. Some moment later, A appends some entry to its local log X. At the same time, B does some modification to its X. A third node also makes some updates. How do we go about making this work? Received on Tue Apr 25 2006 - 15:23:42 CEST