Re: Concurrency in an RDB

From: David <davidbl_at_iinet.net.au>
Date: 18 Dec 2006 04:54:25 -0800
Message-ID: <1166446464.885673.257380_at_f1g2000cwa.googlegroups.com>

Sampo Syreeni wrote:
> On 2006-12-17, David wrote:
>
> > In OT there is never a need to report failure. Transactions are
> > guaranteed to succeed! A user makes a local modification on local
> > state without any risk of a dirty read w.r.t the local database.
>
> Yes, but in the general case there is a risk of dirty reads to remote
> state. It's being hidden under a symmetry assumption, which effectively
> says that whatever the protocol can produce under such circumstances
> must be a valid state of the distributed DB, but the restriction is
> still there. It shows up in that OT isn't generally applicable, and in
> that (the corrected version of) dOPT annuls some transactions, i.e.
> reduces to an OCC protocol which can only guarantee durability after
> it's completed a kind of validation round.

Although OT behaves like the ultimate OCC protocol there is no direct concept of dirty reads, validation or conflict resolution. However it certainly behaves like that is the case.

> > I agree that with some effort it may be possible to increase
> > performance. Note that if errors are common then this approach may not
> > help much.
>
> By parallelization, adaptive choice of the coding algorithm you use to
> trace faulty transactions (i.e. you try independent sets of compatible
> updates so that the probability of success is approximately 1/2) and
> temporal interleaving, I think you can get pretty much arbitrary
> throughput, with latency dominated by batch buildup.

Consider that 50% of the individual transactions break the constraint. Then batching won't help. Eg try a batch of size 2. Expectation value of number of compiles per batch = 0.25*1 + 0.25*2 + 0.5*3 = 2.25, which is actually worse than not batching at all.

> > Your approach means that a transaction appears to commit even though
> > in actual fact it may be discarded. i.e. it is breaking the
> > durability assumption commonly expected on a DBMS. This will prevent
> > it being used for distributed transactions. Do "open transactions"
> > address this?
>
> Yes. The locking protocol we're discussing prevents any action which
> would be incompatible with the pending compensation from proceeding. At
> the same time compatible, commutative processes benefit from a higher
> level of concurrency because they proceed independently. The point of
> SCC is that ACID properties hold semantically, eventhough they're
> relaxed with respect to low level syntax, and this is used to allow more
> concurrency.

> > Is the flag for whether Bill has had surgery, a functional dependency
> > on other state, and therefore doesn't deserve to be part of the data
> > model?
>
> No. For example, a receptionist might want to (weakly) authenticate Bill
> based on some of the shared knowledge they have on Bill's history. After
> that, the knowledge that he's had surgery is previously unknown to the
> system and couldn't have been deduced from its contents. It's just that
> the new DB state after the update is jointly dependent on both novel
> input *and* the previous state, and the precise dependency is unknown to
> the DB because it's being implemented by an ad hoc human user.

It seems to me that as far as the DB is concerned the flag is completely independent and the system has no need to pay the price of locking out concurrent editing while the user makes the decision based on other state in the DB. I thought in practise a good DB application would never keep transactions open while dialogs are displayed for accepting user input.

> > Functionally dependent state can be very expensive to calculate, and
> > may need to persist in order to avoid unnecessary recalculation. But
> > I regard that as a caching issue.
>
> Certainly, but in those terms we're just talking cache coherency. That's
> basically the same issue as before: how to guarantee distributed
> atomicity and consistency.
>
> >> In your paradigm you'd probably want to make such updates idempotent.
> >> But that's risky, because it calls for precision that human record
> >> keepers simply don't have, and it limits the kind of ad hoc work that is
> >> often necessary in such business.
> >
> > I'm not sure what you mean here. Assignment (with the same value) is
> > idempotent.
>
> Assignment is, yes, but people are exceedingly bad at certain kinds of
> complicated assignment. For example, Bill's surgeries might be modelled
> as a set of timestamped events. Then you have a choice between set
> assignment and if-exists-then-nop-else-insert style inclusion. If you
> let people use such primitives, they're going to mess up the database
> because they're bad at entering precise facts. The only way you're going
> to have any hope of not duplicating some of those timestamps in the long
> run is if you let people check which dates are already there, and
> proceed from there. That calls for remote modifications and locking.

Yes, if we use pessimistic locking we hurt concurrency but improve correctness. However, IMO if the patient information relates more to "data" than "device" then it is better to use an asynchronous data entry model and provide powerful data validation based on read only queries that pick up duplicate data entry errors etc. Decoupling data validation from data entry is more flexible, efficient, modular and easier to think about.

> > 2. only employ simple, local integrity constraints compatible with
> > fast transactions that don't need to block on network I/O.
> >
> > In the second case the system is far more robust because
> > "inconsistencies" between geographically separated sites is explicitly
> > allowed for in the design. Furthermore, the parts of the system are
> > more autonomous (and asynchronous).
>
> Well, sure, but now you're once again assuming the problem away. Locking
> and integrity constraints are means of managing complexity in data and
> update patterns.

I don't like that philosophy. I believe complex CPU intensive integrity constraints should be based on read only data validation queries. Therefore they only help the user to correct mistakes that have already been committed. Branching, merging tagging and so on can be used to distinguish a valid DB from a work in progress.

> You can't really avoid them if your data and access
> patterns happen to require them. If you try, you'll be limiting the
> applicability of your system, and that makes the system far less
> valuable to the user. Often in O&M, the database and the network
> environment are there *precisely* so that one station can interface with
> another across the physical production line, and make decisions based on
> distributed input. The data processing side is there to manage the
> complexity, permit feedback control with large numbers of variables,
> provide possibilities for different synchronization policies so that
> lockstep operation and large buffers within the production line can be
> avoided simultaneously, to provide modularity, flexibility and
> configurability, and so on. That's what the production process needs, so
> that's what you'll have to provide. In O&M, it's the line designers
> who're making all the money, not the DB guy, so you don't really get to
> manage the DB environment any which way.

This discussion is in danger of getting a little nebulous!

I agree there will be times where pessimistic locking is useful - such as the aeroplane seat reservation example.

>
> > Note as well that a distributed integrity constraint may indicate a
> > denormalised design. Can the redundancy be avoided?
>
> What redundancy? The only part in my example that could possibly be
> redundant are the production statistics. They're neither essential to my
> argument, nor can you always avoid them even if redundant. I mean, we've
> already gone over the benefits of isolating long running read
> transactions from the update workload. You can't do that unless you have
> a redundant copy of the data somewhere for the reads to use.

Cheers,
David Received on Mon Dec 18 2006 - 13:54:25 CET

This message: [ Message body ]
Next message: Sampo Syreeni: "Re: Concurrency in an RDB"
Previous message: David: "Re: Concurrency in an RDB"
Maybe in reply to: David: "Concurrency in an RDB"
In reply to David: "Re: Concurrency in an RDB"
Next in thread: Sampo Syreeni: "Re: Concurrency in an RDB"
Reply: Sampo Syreeni: "Re: Concurrency in an RDB"
Reply: NENASHI, Tegiri: "Re: Concurrency in an RDB"
Reply: David: "Re: Concurrency in an RDB"
Reply: David: "Re: Concurrency in an RDB"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message