Re: efficient compare

From: Bob Badour <>
Date: Sun, 23 Apr 2006 21:26:51 GMT
Message-ID: <viS2g.64779$>

Andersen wrote:

> Bob Badour wrote:

>>> I.e, first send checksum of the value of all bins 
>> But you said N is a large number which means you have to send a large 
>> number of checksums.

> Eh you misunderstood. Here it is more detailed.
> Have N bins. Hash every tuple to get an integer between 1..N. So every
> tuple now belongs to a bin.

[snipped example]

We are dealing with three numbers: C, N, and M. C is the cardinality of the relation and is linearly proportional to the size of the relation. N is the size of the hash table, which is arbitrarily chosen as a large number. M is the number of changes since the last synch and is linearly proportional to the size of the log file.

We established earlier that C >> M.

Assuming a uniform distribution and C > N, one can expect to find C/N tuples in each bucket. It makes no sense to choose N < M, because one will expect to find M/N > 1 updates in each bucket meaning you will send O(2^N) messages plus O(C) tuples. If we choose N > M and have a uniform distribution of changes in buckets, we can expect O(M) <= M buckets to have updates.

In the example snipped, you give M = 1 and log(N) = 160. I simply observe that 160 >> 1. For M << N, your example suggests you will need to send O(M * log(N)) messages plus O(M * C/N) tuples, which is much larger than the logfile O(M).

What did I misunderstand? Received on Sun Apr 23 2006 - 23:26:51 CEST

Original text of this message