# Re: efficient compare

Date: Sun, 23 Apr 2006 22:02:30 +0200

Message-ID: <444BDD56.5000702_at_hotmail.com>

>> I.e, first send checksum of the value of all bins

*>*

*>*

> But you said N is a large number which means you have to send a large

*> number of checksums.*

Basic operation: Hashing a bin. Take all the tuples in the bin and use some SHA1 hash of each and add or xor them. This means you now have 1 hash value representing the whole bin.

Basic operation: Hashing several bins. Just recursively apply "hashing a bin". I.e., to get the hash of bins i, j, just hash each bin and add or xor their values. That means you now have 1 hash value representing two bins.

Assume computer X has set A and computer Y has set B.

Round1:

Computer A hash all bins 1..N and get a single hash value representing
all. Send to computer B.

Computer B hash all bins 1..N and compare with hash value sent by
computer A. If they match, we are done and have the same data, otherwise
go to round 2.

Round 2:

Computer A hash bins 1..N/2 and bins N/2..N, to get two hash values.
Computer B does the same and compares.

If only first hash mismatch: Round 3 does 2 hashes of bins 0N/4..1N/4, and 1N/4..2N/4.

If only second hash mismatch: Round 3 does 2 hashes of bins 2N/4..3N/4, and 3N/4..4N/4.

If both hashes mistmatch: Round 3 does 4 hashes of bins 0N/4..1N/4, and 1N/4..2N/4, and 2N/4..3N/4, and 3N/4..4N/4.

I think you get the idea. Lets say you change a single tuple... that means a single bin changes. That would require:

Round 1: 1 msg with 1 hash value, Round 2: 1 msg with 2 hash values, Round 3: 1 msg with 2 hash values,

...

Round log(N): 1 msg with 2 hash values.

I.e. log(N) msgs, with total 2*log(N) hashes exchanged.

Of course if N is huge, log(N) will be something like 160 msgs. But you could say that you stop after msg 30, and simple transfer whatever bins that have mismatches. That would mean that even if a single tuple changed, you'd be forced to transfer 2^(-30) fraction of the bins. Whether that would be a lot or not would depend on the number of msgs in each bin...

This is a naive algorithm. There must be smarter ones. Received on Sun Apr 23 2006 - 22:02:30 CEST