# Re: efficient compare

From: Andersen <andersen_800_at_hotmail.com>
Date: Sun, 23 Apr 2006 22:02:30 +0200
Message-ID: <444BDD56.5000702_at_hotmail.com>

```>> I.e, first send checksum of the value of all bins
```

>
>
> But you said N is a large number which means you have to send a large
> number of checksums.

Basic operation: Hashing a bin. Take all the tuples in the bin and use some SHA1 hash of each and add or xor them. This means you now have 1 hash value representing the whole bin.

Basic operation: Hashing several bins. Just recursively apply "hashing a bin". I.e., to get the hash of bins i, j, just hash each bin and add or xor their values. That means you now have 1 hash value representing two bins.

Assume computer X has set A and computer Y has set B.

Round1:
Computer A hash all bins 1..N and get a single hash value representing all. Send to computer B.
Computer B hash all bins 1..N and compare with hash value sent by computer A. If they match, we are done and have the same data, otherwise go to round 2.

Round 2:
Computer A hash bins 1..N/2 and bins N/2..N, to get two hash values. Computer B does the same and compares.

If only first hash mismatch: Round 3 does 2 hashes of bins 0N/4..1N/4, and 1N/4..2N/4.

If only second hash mismatch: Round 3 does 2 hashes of bins 2N/4..3N/4, and 3N/4..4N/4.

If both hashes mistmatch: Round 3 does 4 hashes of bins 0N/4..1N/4, and 1N/4..2N/4, and 2N/4..3N/4, and 3N/4..4N/4.

I think you get the idea. Lets say you change a single tuple... that means a single bin changes. That would require:

```Round 1: 1 msg with 1 hash value,
Round 2: 1 msg with 2 hash values,
Round 3: 1 msg with 2 hash values,
```

...
Round log(N): 1 msg with 2 hash values.
I.e. log(N) msgs, with total 2*log(N) hashes exchanged.

Of course if N is huge, log(N) will be something like 160 msgs. But you could say that you stop after msg 30, and simple transfer whatever bins that have mismatches. That would mean that even if a single tuple changed, you'd be forced to transfer 2^(-30) fraction of the bins. Whether that would be a lot or not would depend on the number of msgs in each bin...

This is a naive algorithm. There must be smarter ones. Received on Sun Apr 23 2006 - 22:02:30 CEST

Original text of this message