Re: Implement my own in-memory database

From: paul c <toledobythesea_at_oohay.ac>
Date: Wed, 27 Jun 2007 11:24:20 GMT
Message-ID: <ENrgi.67352$xq1.49210_at_pd7urf1no>


Sune wrote:
> Hi all,
>
> I'm a programmer with 10 years experience experience and I'm about to
> (try to) implement my own in-memory database. I have worked with
> databases on Solaris for the past 4 years or so.
>
> I want to implement my own in-memory database. However, I lack
> detailed knowledge about the inner workings of a database, such as
> page indexing, dirty page schemes, transaction log schemes, different
> ways of checkpointing and so on.
>
> I try to read white papers such as http://research.solidtech.com/publ/apl-awo-icde06.pdf
> but I'm not able to understand all of it since I don't have the proper
> background.
>
> What resources would you suggest I use (online books, books to buy,
> whitepapers, materials from universities...)?
>
> Thanks for you time
> /Sune
>

(Ironic that such papers have to spend so much time talking about disks!)

A quick answer is that before figuring out the "inner workings", it's likely better to first nail down most of the external "workings" and balance those requirements against the internal components needed and how much time you have.

I realize that you didn't ask for this answer but if it were me, the first thing I'd do is try to eliminate as many internal components as possible, for example, you mention checkpointing, but do your update volumes / recovery needs / acceptable start-up times really need that feature? If not, you could save a lot of time by using only a synchronized logical redo log. Size does matter, because even if a solo can come up with tens of thousands of lines of code quickly, it doesn't mean that person will have enough time to support it. So maybe 'solo development' is in fact one of the external requirements.

I'd say a starting point would be a list of twenty or thirty conventional features / requirements, including stuff like programming interface (if any), end-user interface (if any, eg., are you talking about an embedded "sub-language" in the Codd sense?), various logical capacities such as number of rows / tables per unit of available memory, language complexity, logical indexing requirements, memory management /storage efficiency, (not just utilization but latency given the continuing gap between cpu and memory speeds), whether multi-threading is really necessary, network support, blah blah blah.

It is an interesting question to me because I think that if today's memory capacities had been available thirty years ago, many of the research directions and common implementation techniques would have turned out different. For example, contrary to some of those developments, it might have been that a single-threaded in-memory dbms would have turned out to be the canonical logical test for concurrency techniques and for many OLTP apps might have sufficed, avoiding a lock manager component entirely. I think you will need a memory manager but if you can design storage structures that avoid a garbage collector while maintaining a predictable amount of memory waste, that seems like something to consider. Sometimes I even wonder whether imaginative use of systems tables could avoid built-in bits like session and security management components. I suspect that with enough effort put into storage structures (eg., using 6/8 btree merge techniques) it might even be possible to index most columns of most tables using roughly double the memory needed to hold raw ascii/utf-8 data without indexing. Maybe even get away with a single process/single thread implementation.

Gray and Reuter's book on Transaction Processing is one reference that many people admire for dealing precisely with many of the implementation problems but it should be mentioned that Gray was not a logical db guy.   Even before that book he was considered a walking bible on concurrency. Another reference is a researcher at microsoft, I forget his name but it starts with "L", he used to be a mainframe guy - if you search research.microsoft.com or somesuch you'll probably see papers of his about latency. Then there are number of language/compiler books that most university libraries should have, written by people like Aho, Ullman et al.

My overall advice is to start with the smallest requirements possible and expect many versions, otherwise you could spend years researching without the satisfaction of seeing a single line of engine code operate.   Apart from other things, this suggests to me that SQL (if you have that in mind), is not a great starting point. (The smaller the parser the better, if it were me, I'd aim for less than 15,000 lines if using C and consider something like mozilla xul for a development or even end-user interface, but that's just me. It is risky advice to give, but I'll say it anyway, absorbing much of the "background" that you say you lack might be avoided with sufficient imagination.)

I hope others will comment on the general question as I think it is one of the areas that conventional dbms activity is not spending enough time on.

Good luck,
p Received on Wed Jun 27 2007 - 13:24:20 CEST

Original text of this message