Re: how to build a database from scratch

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Fri, 8 Dec 2006 23:02:35 +0200
Message-ID: <Pine.SOL.4.62.0612081631470.28238_at_kruuna.helsinki.fi>


On 2006-12-08, paul c wrote:

> This is getting further and further away from the original question
> (at least the single answer "use a btree" to the original question)
> which I understood to be - does a BTREE algorithm need to change to
> allow concurrency.

That's precisely why I consider DBMS design a systems exercise. The effects of design decisions ripple from one subsystem to another, and managing the overall complexity is usually much more of an issue than getting one of the components to work right under a chosen load profile.

> The answer is no if tasks or processes independent from the btree ones
> are used, eg., a logical lock manager (one that locks values, not
> blocks) and a transaction's updates are flushed together.

I don't think so. See below.

> Apart from user caution, the obvious answer to adhoc processing on a
> big scale is a second db image which doesn't need to double update
> overhead if it relies on physical logs and which might be only several
> minutes behind the first image in terms of its freshness.

Then your accounting department is going to start complaining about the value proposition of two separate, specialized databases with two sets of specially trained DBA's, without full interoperability, and so on. In most cases you're going to have a hard time convincing them that a single Oracle instance wouldn't offer better bang for the buck. This is simply because the reason mainstream commercial DBMSs are so complex is that they (more or less successfully) try to offer everything to anyone at the same time.

I mean, if you think about it a bit and refer back to the history of multiversion concurrency control, it's there precisely to automate the maintenance of that second (or nth), queryable copy of the database so that read-only (and read-mostly) transactions have a stable source to work on without you laying a finger on it.

> A decent designer will know the workload a priori.

A decent DB designer must, yes. Whether a DBMS designer can be expected to know it depends on which camp you're in, one-size-fits-all or RISC.

> Fair enough, but I see no reason why a btree algorithm needs to
> recognize latency.

Because walking a non-memory resident tree imposes one seek latency per tree level, and this needs to be hidden. You need to have as many tasks walking the tree at the same time as possible, with as little interference with each other as possible.

Some of the tasks must also be capable of walking the tree for eventual update, and that eventual update must be allowed to modify the topology of the tree because that's what keeps B-trees balanced. In the presence of a naïve tree and concurrency, each update task could essentially want to reassign the whole tree at will, because that's what a sequential computation ends up doing once it's done with its modifications and nothing restrains the designer of a sequential algorithm from doing just this. Thus if you treat a sequential algorithm as a black box for interfacing purposes, you'll always have to treat it as something that can reassign everything it has access to at will, and as something that could arbitrarily compete for the right to do so.

Obviously reassigning a whole tree is terribly far from what we can do at best when we know more about what the many parallel tasks are actually trying to accomplish.

In reality you do not want arbitrary reassignment, eventhough the algorithm was originally built in an environment that would allow it. Because of external, largely arbitrary reasons sequential algorithms invariably end up assuming much less. But in order to make the parallel algorithm do better than emulating the sequential one, you first need to understand which precise assumptions the sequential algorithm is based on. Mostly these assumptions live at the higher, logical level of invariants, structure and convention. B-tree algorithms largely rely on the topological invariants of balanced tree structures and the order on the indexed domain. Those B-tree specifics allow you to decompose the reassignment into incremental, localized steps which can run efficiently in parallel without interfering with each other, while still producing a serializable outcome.

Now the problem is that those smaller pieces are tightly bound to the the tree structure and the precise sequential update algorithm we started with. The locking protocol that makes such updates go real fast wouldn't work for any other data structure or algorithm.

That's why you eventually want to make the tree itself locking aware: even if you could export all of the concurrency relevant stuff to a generic locking component, the most efficient implementation would call for implementing immense amounts of tree specific code, and the coupling between the locker and the tree would also become so tight that call overheads, cache performance and code path length just wouldn't align up. The complexity of the locking component would increase as much as or more as when you explicitly parallelize the tree itself. That's why the latter option is mostly chosen in late generation DBMSs.

-- 
Sampo Syreeni, aka decoy - mailto:decoy_at_iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
---559023410-1903590565-1165589648=:28238--
Received on Fri Dec 08 2006 - 22:02:35 CET

Original text of this message