Re: Table driven, massively parallel RDBMS in the future

From: DBMS_Plumber <paul_geoffrey_brown_at_yahoo.com>
Date: 12 Feb 2007 13:56:09 -0800
Message-ID: <1171317369.697444.212400_at_v45g2000cwv.googlegroups.com>


On Feb 10, 10:03 am, "-CELKO-" <jcelko..._at_earthlink.net> wrote:
> Computations are much harder than joins, so the furure belongs to
> simple theta operators that can be expressed in a few low level
> commands in a processor unit. The programming then moves "up the
> tree" until we have a final answer.

  In the early part of the post, what you are describing is a 'functional query language' - one where the only things which the system supports are domains, relations and relational operators. No domain level operators at all. It's an idea that has been proposed before, in this fine forum in fact.

   In engineering terms the second thing you're describing an engine that possesses what us 'plumbers' know as intra-node parallelism, or query pipelining. Most modern DBMS engines already support this to some degree, and yes, the adoption of multi-core CPUs is likely to change some aspects of how DBMS internals work.

   But, but, but ... there are serious practical problems with these approaches that are well illustrated by an exploration of some underlying theory.

   It is one thing to improve the efficiency with which a DBMS handles a query plan. It is quite another thing -- a harder thing -- to compile the syntactic form of a query expression into an (efficient) physical plan. Query planning is - famously - NP-hard in the number of relations (the space of plans grows in a combinatorial fashion wrt. the number of relations). So if you convert operations which are currently embedded within the algorithms implementing relational operators into relational operators themselves, you're going to significantly increase the number of plans the engine needs to consider (somehow).

   Now, this is a problem even in the table driven solutions you're proposing. Consider your 'date <-> string' mapping table.

   CREATE TABLE Emps (

       Eid         INTEGER    NOT NULL PRIMARY KEY,
       Dept       INTEGER    NOT NULL,
       Name     VARCHAR   NOT NULL

   );

   CREATE TABLE Jobs (

       Jid          INTEGER    NOT NULL PRIMARY KEY,
       JName    VARCHAR   NOT NULL

    );

    CREATE TABLE Worked_On (

        Emp      INTEGER    NOT NULL REFERENCES Emps ( Eid ),
        Job        INTEGER    NOT NULL REFERENCES Jobs ( Jid )
        Start      DATE          NOT NULL,
        End       DATE           NOT NULL
    );

   SELECT E.EName, J,JName

       FROM Emps E, Jobs J, Worked_On W
    WHERE E.Emp = E.Id

          AND J.Job   = J.Id
          AND :DAY   BETWEEN E.Start AND E.End;

    Q1: Who was working on what on a given day?

   Now - let us consider the same query done with a mapping table that, as your example suggests - maps strings to dates and vice versa.

   SELECT E.EName, J,JName

       FROM Emps E, Jobs J, Worked_On W,
                  SD_Map M
    WHERE E.Emp = E.Id
          AND J.Job   = J.Id
          AND M.String = :Date_As_String
          AND M.Day BETWEEN E.Start AND E.End;

   In it's original form, Q1 has 3! or 6 possible join orders. Add a single additional table to the mix and that goes up to 24 join orders. Not a big problem in terms of 3 tables vs 4 tables, but it can get very, very awkward when a 6 table query blows into a 10 table query.

  Now, relational DBMSs of all flavors have a huge number of tricks for getting around this problem. Mostly they involve applying heuristic "guesses" designed to eliminate large parts of the plan space, but these guesses can (and often do) go awry. In general, adding tables to the mix increases the amount of time it takes the engine to figure out an efficient physical plan.

  For similar reasons, beyond a certain point parallelism of the type you describe in the second part of your post ceases to buy you any speed-up or scale-up advantage as the additional work involved in organizing the parallelism swamps the improvements you achieve. Mulicore  CPUs are really going to make a difference to DBMSs where you have high transactional throughput requirements; those where each of the queries operates in a (more or less) independent fashion. But the vision of 'bottom up the tree' is limited by the way the nodes of the tree need to communicate their results. Received on Mon Feb 12 2007 - 22:56:09 CET

Original text of this message