Re: how to build a database from scratch

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Thu, 7 Dec 2006 03:29:32 +0200
Message-ID: <Pine.SOL.4.62.0612061158530.27539_at_kruuna.helsinki.fi>


On 2006-12-06, paul c wrote:

> [...] rather he will choose to add a storage/lock et cetera component
> and force the btree code to operate under that component's
> constraints, eg., not allow it to be aware of disks, et cetera. the
> problems you mention go away [...]

I don't think this is the case. Things like parallelism and recovery are orthogonal in the conceptual sense, but from the point of view of algorithms, data structures and system architecture they are heavily cross-cutting and interact with each other. Naïve B-trees plus indirection is just one example of a strategy that hasn't worked out in practice. Here the reason is mostly the need for fine grained write parallelism in OLTP environments. The tree will easily become a bottleneck because the interdependencies and interactions caused by page splits force you to lock too aggressively. When you hit that wall, you cannot circumvent it by adding levels of indirection because the problem is in the tree maintenance code, and layering will not remove that from the code paths that are being blocked.

When competitive pressures eventually force you to address the issue, you'll start considering alternatives like multigranularity, multimode locking, or optimism plus validation. At first you'll probably consider separating the concern and creating a totally generalized locking component. But the effects of such a decision will immediately ripple. Suddenly you'll hae to deal with a whole new set of problems, like ghosting, high lock overhead and the increased susceptibility to deadlock that comes with a high level of concurrency. Maintaining the more fine grained lock data outside the tree becomes difficult because logical addressing of the locked parts would have to be adapted to the structure of the tree and that makes the structure aware part just as complex and tightly coupled as it would be when implemented as part of the tree itself. Predicate locks have too high checking overhead. Generalized, fine grained physical locking is difficult to implement in the presence of nonlinear space allocation and recovery. Cache hit rates become an issue. Or, say, you start suffering mysterious lockups under heavy load because you forgot that now lock data needs to be brought in before a transaction can be killed to free up memory. The list goes on and on. In the end most DBMS designers have opted for reengineering the tree and isolating much of the complexity within it, instead of going with strict separation of concerns and the distributed complexity that seems to come with it.

IMO, building a DBMS is a systems exercise. To get it right, you'll have to find the proper balance between many opposite concerns. Modularity and the maintainability that comes with it are two such concerns, but they are often at odds with e.g. performance. I don't have any experience implementing a DBMS, but after reading stuff like the ARIES papers and taking a look at a couple of open source DBMS's, I can appreciate the inherent complexity of the problem. For this and other reasons, I tend to agree with the Plumber.

-- 
Sampo Syreeni, aka decoy - mailto:decoy_at_iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
---559023410-684387517-1165420736=:25106--
Received on Thu Dec 07 2006 - 02:29:32 CET

Original text of this message