Re: Concurrency in an RDB

From: David <davidbl_at_iinet.net.au>
Date: 18 Dec 2006 17:48:32 -0800
Message-ID: <1166492912.152396.194270_at_80g2000cwy.googlegroups.com>


Sampo Syreeni wrote:
> On 2006-12-18, David wrote:
>
> > Consider that 50% of the individual transactions break the constraint.
> > Then batching won't help. Eg try a batch of size 2. Expectation
> > value of number of compiles per batch = 0.25*1 + 0.25*2 + 0.5*3 =
> > 2.25, which is actually worse than not batching at all.
>
> Under the rule I gave you, you'd test both transactions serially so you
> won't outright lose anything. But the example does prove my thinking
> wrong.
>
> I'd say the gain isn't related to batch size, but the amount of zeroth
> order statistical redundancy in the faulty-ok bitstream. It's also not
> going to scale past 1/2 probability of error, because an oracle for any
> one of a set of transactions being okay is not available. In practice
> the approach will still be usable: most transactions are correct. You'll
> just have to derive the rest of the throughput by throwing hardware at
> the problem; thankfully it's still embarrasingly parallel. Then latency
> becomes compute bound.

But the whole idea doesn't seem right to me. For example, when a transaction in a batch is shown to be bad, are you simply going to throw away the user's edits? If you "compensate" to what extent are you preserving the intent of subsequent transactions that have a causal dependency?

An interesting feature of OT is that is never applies operations on remote sites in an order that breaks causality preservation.

> > I thought in practise a good DB application would never keep
> > transactions open while dialogs are displayed for accepting user
> > input.
>
> They often do. That's really the so called long running transaction
> problem in a nutshell: if the DBMS doesn't understand application
> semantics, all it can do to preserve integrity is to note which data the
> application has seen, assume the worst case that the entire write set
> could be dependent on the entire read one, and so to enforce
> serializability within the whole transaction read-write set, no matter
> how large or long lived.

It seems a drastic performance hit for little gain. Can you provide a realistic example where it is important?

> > However, IMO if the patient information relates more to "data" than
> > "device" then it is better to use an asynchronous data entry model and
> > provide powerful data validation based on read only queries that pick
> > up duplicate data entry errors etc.
>
> Then you're actually not going with locking or even optimism but with
> open transactions: you can't be sure the data you just entered will
> *really* stick until after the validation has run, and succeeded.

I'm not sure if you should call it that. Take the case of collaborative editing source code using OT. Compilation of the source code represents a (local) validation but only very indirectly has anything to do with whether the local or remote edits "really stick". In fact a validation error often means that *more* needs to be typed in, rather than throwing away what's already been done.

This difference of opinion also seems to relate to your comments of annulling conflicting operations. In the case of collaborative text editing, annulling an insert operation is undesirable and not necessary.

> > Branching, merging tagging and so on can be used to distinguish a
> > valid DB from a work in progress.
>
> In theory, yes, but in practice you're then committing to maintaining
> enough context (in the DB) and expertise (around it) to merge the data
> after the fact, instead of forcing the DB to go from one valid state to
> another and aiming for maximal concurrency achievable within that
> constraint. From some personal experience with Lotus Notes, I can tell
> you that the result can be painful.

I imagine that has more to do with Lotus Notes than the underlying principles :)

> Secondly, it's reasonable that you should be able to exploit semantics
> for concurrency and performance when known. But you seem to be
> advocating at least one protocol which *requires* the availability of
> semantic knowledge, so that merging can be automated. To me that seems
> more like ethics than engineering.

I'm not sure what you mean here. Merging in OT is fully automatic, but merely "syntactic". It can't possibly determine whether the merged result is semantically correct.

It seems to me that your examples that require long running transactions exist precisely because the DB is only storing information at a level where it can't validate the semantics.

I think the question in the end boils down to two rather different philosophies

  1. Each user must lock the data for extended periods to make it atomically transition from one valid state to the next. Fine grained locking is desirable to achieve concurrency.
  2. Users can apply fast, fine grained updates (and real-time collaborate or else branch and merge as necessary), and use various data validation tools to massage the "work in progress" into a valid state (which can then be tagged).

These are both entirely valid ways of thinking and maybe the answer comes down to personal preference.

Cheers,
David Received on Tue Dec 19 2006 - 02:48:32 CET

Original text of this message