The TransRelational Model: Clarifications

From: Josh Hewitt <lajos.nagy_at_gmail.com>
Date: 13 Nov 2004 16:44:37 -0800
Message-ID: <1c92edeb.0411131644.16b0a9e6_at_posting.google.com>



There has been some serious concerns regarding the validity of the performance analysis of the TransRelational Model (TRM from now on). In this post I would like to add some clarification to that analysis and also address some of the criticisms.

Alfredo Novoa (alfredo_at_ncs.es):

> To only analyze single relation queries is more than enough to discard
> the analysis.

First of all, my analysis is just a preliminary, back-of-napkin type calculation that I did in order to establish the feasibility (or lack thereof) of implementing the relational model using the TRM. My reasoning was that any new proposal for storage organization should prove itself on some simple examples before checking it against more complex ones. Hence the lack of queries involving JOIN, SUMMARIZE, etc. In my opinion, no performance analysis that is based on simple queries should be discarded on the grounds that it has not taken into consideration all other kinds of queries. If the analysis is valid then the performance concerns are real and one cannot proceed until those concerns are resolved in some satisfactory manner.

In fact, the two basic functions that I put to a test in my analysis (lookup and record-reconstruction - RR) are such fundamental elements that very few useful queries can be formulated without them. Actually, I cannot think of a single query that does not involve lookup and/or RR.

The performance of lookup and RR cut to the heart of any storage organization so negative results in this area are not something that can be handled lightly. All in all, I think that even a preliminary analysis that does not cover every aspect of a new system can be of importance.

Now let's move on to the more debatable question of whether the analysis is valid or not.

Using B-Trees or Hashing in the TRM instead of binary search


Alfredo Novoa (alfredo_at_ncs.es):

> For instance he is assuming ... that the implementation of the TRM
> tables don't use BTrees or hashing techniques ...

Now this is a very good observation. In fact, closer examination of the question of indexing reveals an interesting feature (?) of the TRM: that the Field-Value tables _cannot_ easily use B-Trees or Hashing for speeding up searches. In fact, the core ideas of the TRM hinge on the positional addressing (meaning that they behave like arrays) capability of the FV and RR tables.

(In the following discussion I am going to focus on using B-Trees, because, as we will see, more or less the same arguments apply to Hashing as well.)

First, an important remark: the patent itself does not mention the usage of B-Trees. I thought that it might have been just an oversight but if one sits down in front of a piece of paper and tries to think about how the columns in the Field-Values table could be indexed using B-Trees (or Hashing, which is even worse in this case, as we will see), one will be surprised to find that this cannot be easily done! The columns of the Field-Values table cannot be indexed using B-Trees without significantly modifying the TRM itself.

Why not? Well, to put it bluntly: because of the RR table. Because links that point to the next attribute value of the same record, are _positional_ ! That means that attribute values cannot change their positions otherwise they would break the links. The whole idea in a B-Tree index is that keys in the leaf nodes are moving around. Leaf nodes are split or merged and in general two key values that were next to each other can be arbitrarily removed in the next moment.

In order to understand what is wrong with using B-Trees for indexing the columns of the FV table, let's take a moment, and try to answer the following question:

How are the links of the RR table can be implemented?

The first possibility is to use physical pointers. (This physical pointer might point to a specific block on disk along with a specific slot inside the block that stores the attribute value.) For example, in the RR table of the EMP relation, a link in the ENAME column points to the value of the HIREYEAR attribute of each given record.

This does not work. In a B-Tree index, attribute values will move around and whenever they move, all physical pointers that pointed to them will become invalid. Since pointers provide only one-way navigation, it cannot be (easily) know what links have to be modified in the RR table. It might be easier to understand the problem through an example:

  Original relation EMP:     

        E# ENAME HIREYEAR SALARY     

        30    Johnson        1993       5000
        20    Smith          2000       6000
        40    Hewitt         1998       9000
        50    Johnson        1990       7000

  Field-Values (FV) Table (notice that each column is sorted):     

   ROW E# ENAME HIREYEAR SALARY     

    1   20    Hewitt       1990        5000
    2   30    Johnson      1993        6000
    3   40    Johnson      1998        7000
    4   50    Smith        2000        9000
    

  Record Reconstruction (RR) Table:     

        E#       ENAME     HIREYEAR    SALARY
         4        3           3     +--->  2  --+
    +->  3        1    +-->   1  ---+      1    |
    |    1       (2) --+      4            4    |
    |    2        4           2            3    |
    |                                           |
    +-------------------------------------------+
            
                
    

Now, let's suppose that we want to give a raise to employee Smith. In other words, we want ot update his salary from 6000 to 8000:

UPDATE EMP SET SALARY = SALARY + 2000 WHERE ENAME = 'Smith'

This means that we have to look up Smith in the ENAME column of the FV table and then we have to navigate to the SALARY column (using links in the RR table.) Once we know Smith's salary, we can give him a 2000 raise. Obviously, this means that we have to remove his current salary (6000), from the sorted order and then insert 8000 into the sorted order of the SALARY column. If we were to use a B-Tree to index the SALARY column, the actual physical positions of other salary values might have to be changed during the updating process (in the case of a node split or merge, for example.) The salary value 7000 might migrate to a different node, due to the removal of the salary value of 6000. This means that its physical position has changed. But some link in the HIREYEAR column in the RR table pointed to this 7000! We have to update this link so it points to the new location of 7000, otherwise the RR table becomes inconsistent. One way to find this link is to reconstruct the record whose SALARY attribute is 7000: following the links we can make a cycle and get back to the HIREYEAR column and modify the link.

Whenever we want to insert a value into the B-Tree index of an FV column and there is a split or merge, we have to reconstruct _all_ records whose attribute values have been migrated. A merge or a split operation in a B-Tree means moving roughly half of the attribute values to a different node (disk block), this might easily result in hundreds of attribute values changing their positions at the same time. Which means, of course, that hundreds of records must be reconstructed to keep the links in the RR table up to date.

I think it should give us a little pause that the aforementioned series of unfortunate events can be triggered by updating the SALARY attribute of a _single_ employee ...

To summarize, if the columns of the FV tables are to be indexed using B-Trees, or any indexing technique where the keys might change their physical location during the lifetime of the index structure, then it will become quite costly to perform insertions, updates, or deletions on the relation. In fact, as we have seen, prohibitively costly.   

The second possibility is to use logical pointers as links in the RR table. But this means that these logical pointers must be mapped to physical pointers at some point during the RR process. (And it also means, that the size of the RR table is practically doubles.) This is one additional level of indirection _per attribute_ in the relation. But if we have to go through 3 steps to get the next attribute value of the relation (1st step: fetch the logical pointer, 2nd step: map the logical pointer to physical pointer, 3nd step: fetch the actual attribute value), we are no better off than using the traditional inverted table approach. More importantly, the _current_ description of the TRM (that I used for the analysis) uses the tacit assumption of positional addressing, in other words: physical pointers. It might be the case that the TRM must be modified in order to accommodate the usage of B-Trees, but in that case it will be a different storage organization than the one that was analyized.

In my opinion, the reason for using a positional approach (ordered arrays) for links in the TRM is because other approaches (B-Trees) are even worse. If, as someone suggested, we leave 'holes' in the sorted arrays, we might be able to avoid to shift large numbers of attribute values (although, it cannot be completely avoided) so insertions, updates, and deletions become (somewhat) feasible. In exchange, we cannot use anything else _but_ binary search in a sorted array. Hence, the assumption of binary search in the analysis.

About page sizes


Alfredo Novoa (alfredo_at_ncs.es):

> For instance he is assuming ... little page sizes, ...

Quote from the original analysis:

"Let ||EMP|| be quite large (say, around 100`000). Let's say we store 100 employee records per page on average. This means that |EMP| is 1000."

The analysis assumes 100 records per page. What page size does this imply? Well, there are eight attributes in the EMP relation: E#, ENAME, HIREYEAR, SALARY, DEPT#, EMAIL, ADDRESS and PHONENO. The following table gives (quite conservative) estimates on the _average stored sizes_ (not maximum sizes) of these attributes (measured in bytes):

E#            4
ENAME        16
HIREYEAR      2
SALARY        4
DEPT#         4
EMAIL        20
ADDRESS      40
PHONENO      10

Total       100 bytes


This means that we have approximately 100*100 = 10`000 bytes per page. That is, if we do not leave room for any insertions and updates. But because this is rarely the case the actual number is going to be closer to 12`000 bytes or around 12KB. Since the size of disk pages is a power of 2, the smallest disk page that can accommodate 100 records is going to be 16KB. Although vague adjectives like 'little' are always open for subjective interpretation, I think that it is safe to say that in the year of 2004 a disk page size of 16KB cannot be considered little.

Closing remark on the page size issue: I deliberately formulated my cost analysis using disk-block transfers as a performance measure because the general consensus is that disk I/O makes up for by far the bulk of the execution cost in traditional query processing environments.

Caching and the size of the relation


Alfredo Novoa (alfredo_at_ncs.es):

> For instance he is assuming that the page cache does not exist, that
> all the DBMSs are based on disk storage ... that a relation or a
> significant part of a 100000 tuples relation don't fit in main
> memory, he only considers the disk page loading costs, etc.

Before going into details let's make one thing clear: in the analysis I considered using the TRM for implementing a disk-based relational DBMS. Disk-based here means that either the size of the database is larger than available memory, or that updates must be written to disk to ensure the durability of transactions (ACID). The latter is almost always the case in any business environment, especially if there are multiple DBMS instances using the same data on disk, (as is often the case in business environments.) In other words, my analysis does not apply to main memory databases. (They have significantly different internal organizations and quite often completely discard even the notion of assigning records to disk blocks.)

From now on I will assume that we are talking about implementing a disk-based relational DBMS (for example, Oracle, SQL Server) using the TRM. Page (block) caching (or buffering) is a traditional way of increasing the performance of disk accesses. It has been used from the early days of computing, first in operating systems, and later in DBMSs. Whenever one tries to analyze the features of a new storage organization for a disk-based DBMS, one must face the question of how to include the effects of caching into one's analysis. There are two extreme positions taken on this question: (1) ignore caching altogether, (2) let everything fit in the cache. Obviously, (2) simply means that the analysis of the storage organization simply 'degrades' into the analysis of in-memory data structures. Although this analysis can be very useful for main memory DBMSs it is not very useful for disk-based DBMSs where the possibility that some disk blocks are not in the cache cannot be ignored. This is why I did not assume in my analysis that the entire EMP relation is in the cache. (Curiously, in case (2), the traditional approach is also quite efficient because full table-scan, with its sequential access of memory chunks, is extremely cache friendly.)

My analysis was based on position (1): ignore caching for a moment and let's see how the two contestants perform. (Surprisingly, this is an approach that is quite often taken in the field. For the simple reason that if one approach is proved to be better than an other one without caching then turning on caching usually will not change the situation significantly.)

What if we enable caching? Would it significantly alter the results of the analysis? Not really. Let's see why:

First, if we let the whole relation to be cached then our analysis does not mean anything since almost any disk organization will perform well if we let it fit entirely in memory. By the way, the whole idea of 'indexing' is based on the assumption that the data is not sitting conveniently in the cache of our DBMS.

That leaves us with the only useful assumption that _some_ blocks of the relations _might_ be found in the cache. I am going to argue that this does not help the TRM much because of the completely random access nature of its lookup and record-reconstruction steps.

  Lookup

  Finding a record based on the value of one of its attributes is   performed using binary search (as I have tried to show previously   the TRM cannot really use other search techniques.) Binary search   is a random access search. There is absolutely no guarantee that a   disk block fetched in a previous step of the search will be needed   in subsequent steps. In other words, if 20% of the relation is   cached then the probability that the next block accessed by the   binary search will be found in cache is going to be close to 20% on   average.  

  Record-Reconstruction

  This is a _crucial_ thing to understand about the analysis: to   reconstruct a record, one has to follow the links from one field   value in one column to the next field value in the next column.   There is _absolutely no guarantee_ that (and I am quoting the   original analysis here) "two records that follow each other when we   sort the relation on one attribute will follow each other (or will   be even close to each other) if we sort the records on some other   attribute. The consequence is that when we follow the links to the   next attribute in the RR phase it will be more or less a random   access of _FV tables." It might be easier to understand this on an   example:

  relation EMP:     

        E# ENAME HIREYEAR SALARY     

        30    Johnson        1993       5000
        20    Smith          2000       6000
        40    Hewitt         1998       9000
        50    Johnson        1990       7000
    

  Field-Values (FV) Table (notice that each column is sorted):     

   ROW E# ENAME HIREYEAR SALARY     

    1   20    Hewitt       1990        5000
    2   30    Johnson      1993        6000
    3   40    Johnson      1998        7000
    4   50    Smith        2000        9000
    

  Record Reconstruction (RR) Table:     

        E#       ENAME     HIREYEAR    SALARY
         4        3           3     +--->  2  --+
    +->  3        1    +-->   1  ---+      1    |
    |    1       (2) --+      4            4    |
    |    2        4           2            3    |
    |                                           |
    +-------------------------------------------+
            

  There are two employees now whose name is 'Johnson'. If we execute the   query EMP WHERE ENAME = 'Johnson' we will have to reconstruct two   records: {30, Johnson, 1993, 5000} and {50, Johnson, 1990, 7000}.

  The figure shows reconstructing one of the Johnson's. Realize that   when we reconstruct the other Johnson his HIREYEAR, SALARY, and E#   attributes will be very likely _completely_ different from the   previous Johnson's. This translates to the fact that caching does   not help much in this situation because data access does not show   _locality_. If we fetch the block that contains that attribute   value 7000 for the SALARY, we might _never_ need this block again in   the reconstruction of the rest of the result set. To be more   precise: if 20% of the relation is cached then the probability that   the next block accessed by the record reconstruction will be found   in cache will be close to 20% on average.

To summarize, the TRM does not have the nice property of 'locality of access' which translates to the fact that its performance will only increase linearly as we increase the proportion of the relation that is kept in the cache.

To drive the point home, let's take an example from the previous analysis. Let's suppose that in the large EMP relation the query

EMP WHERE ENAME = 'Johnson'

results in a result set of size 20 records.

Lookup (using binary search) for the TRM:

   Cost: log |E#_FV| blocks = log 125 blocks = 7 blocks  

Record-Reconstruction:

   Cost: (ACNT - 1) blocks * ||RS|| = 7 * 20 blocks = 140 blocks

Total Cost: 147 blocks

If 20% of the relation is cached that means that in 20% of the block accesses we will find the block in the cache. Because binary search and the record-reconstruction phase is practically random access we cannot to do better than 20%. There is no access locality: previous fetches do not help the TRM.

So the modified cost:

Total Cost (with caching): 147 * 0.8 = 117 blocks

Let's see what happens if we keep larger and larger portions of the relation in the cache:

  Cached          Cost
  Portion      (in blocks)
   20%            118 
   40%             89
   60%             59
   80%             30
   90%             15
   95%              8

Interestingly, even if 95% of the relation is cached the TRM has to read in 8 blocks to generate a result set of size 20 records. (It is easier to understand these numbers if we think about the fact that the TRM has to read in 147 blocks from a total of 1000 out of which 950 can be found in the cache. Since those 147 blocks are selected completely randomly, about 5% of them (8 blocks) will try to access those 50 blocks that are not in the cache.)

Again, the presence of caching _does not_ invalidate the results of the analysis. (As long as we talk about disk-based DBMSs.)

Inserts


Alfredo Novoa (alfredo_at_ncs.es):

> He also assumes that we need to make room for the inserts and the room
> is not already there.
  

Insertion in the TRM incurs two different kinds of overhead. In my original analysis I only addressed the first of the two: inserting every attribute value into its corresponding position in the Field-Values table. I did not even address the cost of making room for the newly inserted values. The simple reason for this is that it is just an _additional_ cost. In the example I gave I analyzed the insertion of a new Employee record into the EMP relation. Since an Employee record has eight different attributes, the TRM has to insert _each and every_ attribute value into its corresponding sorted set. The cost of performing this search is in itself prohibitive:

 ACNT * log (|EMP|/ACNT) = 8 * log (125) = 8 * 7 = 56 blocks

No matter what indexing technique the TRM chooses to use it will have to perform a search followed by an insert for _each and every_ attribute value. So even if it uses a B-Tree on each attribute value (as we have seen, this would be quite hard to do so) and we assume 2 block accesses for each B-Tree search, the total cost comes to 8 * 2, that is 16 blocks. For inserting a single tuple. If we increase the number of attributes the cost of insertion goes up proportionally. Not to forget the cost of locking each separate index and so on. So before moving on to the problems of finding a room for the new insertion, let's address the problem of overhead caused by eight (or the actual number of attributes) index updates.

My goal with these posts is not to bash the TransRelational Model. What I would like to see instead is a healthy debate that might result in better storage organizations for relational DBMSs.

Josh Hewitt
Ph.D. Student
Florida Institute of Technology Received on Sun Nov 14 2004 - 01:44:37 CET

Original text of this message