| Oracle FAQ | Your Portal to the Oracle Knowledge Grid | |
Home -> Community -> Usenet -> comp.databases.theory -> Re: efficient compare
Bob Badour wrote:
>> I.e, first send checksum of the value of all bins
Eh you misunderstood. Here it is more detailed. Have N bins. Hash every tuple to get an integer between 1..N. So every tuple now belongs to a bin.
Basic operation: Hashing a bin. Take all the tuples in the bin and use some SHA1 hash of each and add or xor them. This means you now have 1 hash value representing the whole bin.
Basic operation: Hashing several bins. Just recursively apply "hashing a bin". I.e., to get the hash of bins i, j, just hash each bin and add or xor their values. That means you now have 1 hash value representing two bins.
Assume computer X has set A and computer Y has set B.
Round1:
Computer A hash all bins 1..N and get a single hash value representing
all. Send to computer B.
Computer B hash all bins 1..N and compare with hash value sent by
computer A. If they match, we are done and have the same data, otherwise
go to round 2.
Round 2:
Computer A hash bins 1..N/2 and bins N/2..N, to get two hash values.
Computer B does the same and compares.
If only first hash mismatch: Round 3 does 2 hashes of bins 0N/4..1N/4, and 1N/4..2N/4.
If only second hash mismatch: Round 3 does 2 hashes of bins 2N/4..3N/4, and 3N/4..4N/4.
If both hashes mistmatch: Round 3 does 4 hashes of bins 0N/4..1N/4, and 1N/4..2N/4, and 2N/4..3N/4, and 3N/4..4N/4.
I think you get the idea. Lets say you change a single tuple... that means a single bin changes. That would require:
Round 1: 1 msg with 1 hash value, Round 2: 1 msg with 2 hash values, Round 3: 1 msg with 2 hash values,
Of course if N is huge, log(N) will be something like 160 msgs. But you could say that you stop after msg 30, and simple transfer whatever bins that have mismatches. That would mean that even if a single tuple changed, you'd be forced to transfer 2^(-30) fraction of the bins. Whether that would be a lot or not would depend on the number of msgs in each bin...
This is a naive algorithm. There must be smarter ones. Received on Sun Apr 23 2006 - 15:02:30 CDT
![]() |
![]() |