Somebody (on this group, I think) criticized TRM (or at least the few
published examples) for poor disk performance which was apparently
disputed on the grounds that he used the wrong examples to come to his
conclusion, if I got it right. But I think TRM is provocative in a
useful way as well.
Ignoring what happens on disk, the two messages I managed to extract
from looking at the TRM were that 1) the 'zig-zag' algorithms allow a
kind of indexing on all attributes which should make many joins faster
as well as reducing the number of sorts that might be needed in general
and 2) there are two basic physical data structures, at least early on
in the patents and in CJ Date's example, loosely one with values and one
with pointers. So it seems that each 'column-row' intersection stores a
value somewhere and a pointer somewhere.
This started me on planning out a simple-minded 'table' manager. I say
'table' instead of relation because although it would have some of the
features of the relational model, it would necessarily have them all. It
seems to me that one can make quite a useful memory-based db engine
based on conventional indexing that in a steady-state would be just as
compact, ie., the amount of memory needed would be roughly the same as
in the TRM scheme. Admittedly, the TRM scheme is over my head, but I
don't see that storing a table row-wise or 'image-wise' and having a
separate index for each column is as feeble-minded as one might conclude
from what's been written about TRM. After all, the row-wise scheme does
keep the 'relation' between values in a row or tuple, for free. If the
rows are all the same size, at least at some level, it also can make
frequent operations such as row inserts and deletes fairly
storage-efficient. Infrequent operations such as adding or removing
columns would need memory working space beyond the minumum needed for
any particular index and 'table' storage scheme. Regurgitating a
'table' would be quite easy and joins which are often such a bug-a-boo
because the fear of their poor performance leads to bizarre schema
designs would be relatively painless with all those indexes.
Taking a simple case, let all 'rows' be fixed-length, containing
integers or floats whose length is at least the length of the pointer
needed for each value. If you have a b-tree-style index that never
stores values, just pointers, then the optimum overhead for such a db
would be the number of 'intersection' values times the pointer size, say
4 bytes on a typical PC. Because memory latency is so insignificant
when compared to disk latency, such an index doesn't really need to
store values. Of course the blocks of memory would rarely be optimum,
so you might want to say that the average overhead could be double that,
say 8 bytes per value, or double the size of an integer on most PC's.
Still, for double floats, it would be the same size. There would be
some additional overhead for memory management, but if one is careful in
choosing memory chunk sizes and how they are linked, I think this would
amount to a fraction of a byte per intersection value. As for strings,
they could be stored in a fixed-length row 'slot' but not necessarily,
they could be 'heaped' and pointed to by a pointer in the fixed-length
'row' slot. The ratio of string index and string memory-structure
overhead to the average size of the strings in an app might well be less
than the ratio for integers and floats.
So, I'm imagining a db, say for some small application that has 100
tables, with average 10 attributes each, with average values about 10
bytes long, and say 10,000 rows per table. That would be 10 million
rows altogether and at 100 bytes for row values, about 1GB of values.
This particular mess would have something like a 100 million
intersection values and at say 5 bytes of optimum pointer and memory
management overhead per value or double that, say 10 bytes on average,
we'd need about 2GB of memory. Such a db would almost fit in the
application memory of something like Windows 2000, but it seems a
stretch to me.
But what I've noticed over the years is that many useful db's are much
smaller than this, in terms of actual data, especially some of the
old-fashioned ones such as airline reservation systems, not to mention
the typical video or grocery store or even many web apps such as
password lookups. If I cut the above row count in half, I'm down to 1GB
or so. If I cut my database down to 1 million rows across all tables,
we're down to about 200 MB. Even if I add to this app some fairly big
table with a million rows all by itself, I'm using about 400 MB. That's
territory that many PC's have today. It's really the number of rows and
attributes that matters for such a db, not the number of 'tables'. Try
keeping 1 or 2 million rows in your Excel 'database'.
I still haven't said anything about concurrency, eg. locking and the
interleaving of updates but I think this can be dealt with by some
unconventional thinking about table and db constraints, allowing such an
engine to be single-threaded. As for persistence, such a db would read
everything in some form at start-up and write a redo-log during
operation. Its goal would be to keep the cpu 100% busy, given
sufficient user activity, minus whatever time the redologging kept it
idle. I think garbage-collection can mostly be built into the basic
algorithms. Maybe this would hurt concurrency in an app that is mostly
update which is a good argument for developing it on a slow cpu!
I think the whole area of using memory in a different way is interesting
and the idea of smaller applications doesn't bother me personally. In
fact, I often wonder whether a bunch a small apps isn't a better way,
project-management-wise, to get things done with fewer surprises.
Anyway, I know I didn't answer the question, but that's what looking at
TRM made me think. Don't know how far I'll get, but I think it's a
niche that's worth a try.
p
Received on Wed Nov 23 2005 - 20:33:54 CET