Re: efficient compare

From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Fri, 21 Apr 2006 16:51:18 GMT
Message-ID: <a482g.63735$VV4.1191747_at_ursa-nb00s0.nbnet.nb.ca>


Andersen wrote:
> Bob Badour wrote:

>>
>> The efficiency will depend on the physical structure representing the 
>> sets and on the costs of indexes or sorts.

>
> Thats why I am asking, which structure and indexes would be preferable.
> Assume A and B are on different computers, the goal is to minimize the
> amount of traffic needed to "unify" these two sets.

As an example of the importance of physical structure, where do you intend to evaluate this result? At the computer containing A, c(A), at the computer containing B, c(B), or at some other computer c(C) ?

As another example, what is the maximum size of a single datagram passed over the network and what is the size of the representation of the tuples? I can think of an optimization that would improve performance if two or more tuples fit in a single datagram.

If you are trying to minimize traffic between the computers, then presumably the cost of any sorts or index creation on the computers matters less. But then again, one would still have to weigh the expected savings against the costs.

Do you envision this as part of a distributed database? Some sort of replication architecture? To perform a merge-purge of mailing lists? Simply to reconcile to similar but independent databases?

Each of those scenarios will affect the opportunities for optimization. For example, in the case of some sort of replication scheme, presumably the databases were reconciled at some earlier time. One can also presume that the volume of updates is low relative to the total size of the database. Otherwise, each database would very quickly become entirely out of date with respect to the other.

Since the size of the intersection will be relatively small and because the dbmses will have to reconcile updates temporally, it probably makes sense to just share log files from the previous checkpoint forward. Compressing the log files would be your primary efficiency opportunity.

>> Is this for a course you are taking?

>
> Of course not. I am not looking for a full-fleged solution. I want to
> know the "state-of-the-art". I assumed the database community would have
> done lots of work on this. I know a trivial solution, hash each set's
> key/values and compare, if the differ, split the sets in two pieces
> (sorted) and exchange... and repeat this until you track the changes.
> This would require log(n) exchanges to get down to any particular bit.
> There must be lots of other ways...

Pardon me for observing that it sounds like a question or essay topic for a course of some sort. People working for a dbms vendor implementing some sort of distributed database or replication feature would tend to keep abreast of the state of the art using much better sources than usenet. Received on Fri Apr 21 2006 - 18:51:18 CEST

Original text of this message