Re: efficient compare
Date: Sun, 23 Apr 2006 22:02:30 +0200
Message-ID: <444BDD56.5000702_at_hotmail.com>
>> I.e, first send checksum of the value of all bins
>
>
> But you said N is a large number which means you have to send a large
> number of checksums.
Basic operation: Hashing a bin. Take all the tuples in the bin and use some SHA1 hash of each and add or xor them. This means you now have 1 hash value representing the whole bin.
Basic operation: Hashing several bins. Just recursively apply "hashing a bin". I.e., to get the hash of bins i, j, just hash each bin and add or xor their values. That means you now have 1 hash value representing two bins.
Assume computer X has set A and computer Y has set B.
Round1:
Computer A hash all bins 1..N and get a single hash value representing
all. Send to computer B.
Computer B hash all bins 1..N and compare with hash value sent by
computer A. If they match, we are done and have the same data, otherwise
go to round 2.
Round 2:
Computer A hash bins 1..N/2 and bins N/2..N, to get two hash values.
Computer B does the same and compares.
If only first hash mismatch: Round 3 does 2 hashes of bins 0N/4..1N/4, and 1N/4..2N/4.
If only second hash mismatch: Round 3 does 2 hashes of bins 2N/4..3N/4, and 3N/4..4N/4.
If both hashes mistmatch: Round 3 does 4 hashes of bins 0N/4..1N/4, and 1N/4..2N/4, and 2N/4..3N/4, and 3N/4..4N/4.
I think you get the idea. Lets say you change a single tuple... that means a single bin changes. That would require:
Round 1: 1 msg with 1 hash value, Round 2: 1 msg with 2 hash values, Round 3: 1 msg with 2 hash values,
...
Round log(N): 1 msg with 2 hash values.
I.e. log(N) msgs, with total 2*log(N) hashes exchanged.
Of course if N is huge, log(N) will be something like 160 msgs. But you could say that you stop after msg 30, and simple transfer whatever bins that have mismatches. That would mean that even if a single tuple changed, you'd be forced to transfer 2^(-30) fraction of the bins. Whether that would be a lot or not would depend on the number of msgs in each bin...
This is a naive algorithm. There must be smarter ones. Received on Sun Apr 23 2006 - 22:02:30 CEST