Re: efficient compare

From: Andersen <andersen_800_at_hotmail.com>
Date: Sat, 22 Apr 2006 13:16:15 +0200
Message-ID: <444A107F.8060605_at_hotmail.com>


Bob Badour wrote:
>
> As an example of the importance of physical structure, where do you
> intend to evaluate this result? At the computer containing A, c(A), at
> the computer containing B, c(B), or at some other computer c(C) ?

I really want to do unification between A and B. I.e., after A and B exchange some information, they should each have (A union B) locally. Only computers in the world are A and B, and there is a network between them, and we want to minimize traffic. Local computations on A and B almost take zero time (well exclude solutions where you use some fancy extremely costly fractal compression or anything of that sort).

> As another example, what is the maximum size of a single datagram passed
> over the network and what is the size of the representation of the
> tuples? I can think of an optimization that would improve performance if
> two or more tuples fit in a single datagram.

MTU=1500?
> If you are trying to minimize traffic between the computers, then
> presumably the cost of any sorts or index creation on the computers
> matters less. But then again, one would still have to weigh the expected
> savings against the costs.

Sorry, my assumption was that index creation and things of that sort are definitely allowed (I assumed the solution would involve something like that).

> Do you envision this as part of a distributed database? Some sort of
> replication architecture? To perform a merge-purge of mailing lists?
> Simply to reconcile to similar but independent databases?

Some kind of replication architecture.

> Each of those scenarios will affect the opportunities for optimization.
> For example, in the case of some sort of replication scheme, presumably
> the databases were reconciled at some earlier time. One can also presume
> that the volume of updates is low relative to the total size of the
> database. Otherwise, each database would very quickly become entirely
> out of date with respect to the other.

Right, I will periodically do this unification, to try to keep the replicated dbms in sync.

> Since the size of the intersection will be relatively small and because
> the dbmses will have to reconcile updates temporally, it probably makes
> sense to just share log files from the previous checkpoint forward.
> Compressing the log files would be your primary efficiency opportunity.

I want to be able to use the same algorithm to sync node A and B where one of them has not been around at all, or has a very outdated database. But since the sync is done periodically, most of the time, the computers should be in sync. So another assumption is that if A = B, then the algorithm should be very efficient.

> Pardon me for observing that it sounds like a question or essay topic
> for a course of some sort. People working for a dbms vendor implementing
> some sort of distributed database or replication feature would tend to
> keep abreast of the state of the art using much better sources than usenet.

I find asking on usenet much better, as it would maybe take me years to gain the experience that a particular person has in a field. I have Ullman/Garcia Molina's Database book, but that would take me some while to dig through (I did take database courses many years ago in my undergrad education). Received on Sat Apr 22 2006 - 13:16:15 CEST

Original text of this message