Relational query with path expressions

From: David BL <davidbl_at_iinet.net.au>
Date: Thu, 2 Apr 2009 20:58:31 -0700 (PDT)
Message-ID: <edd73d6f-678e-4a1c-bd1f-ac59e06f9bb1_at_s12g2000prc.googlegroups.com>



I get the impression that a lot of the time, a query (or a large portion of a query) can be thought of as a navigation from attribute to attribute using a sequence of joins. This made me think there may be some merit in thinking of attribute names as global in nature and the user only has to specify the start and end attributes and the DBMS is able to deduce the joins that are required. The potential upshot would be that queries are simplified and the user is freed from the burden of understanding the way the data has been decomposed into separate relations. Just recently I read about the so called Universal Relation which takes this approach. However there are a number of criticisms - such as those described in a paper by Kent.

I think the biggest problem is that quite often there are multiple useful paths. For example, consider Chris Date's supplier example from his book Introduction to Database Systems. The relations are

S(S#, SNAME, STATUS, CITY)
    Supplier number S# has name SNAME, status STATUS     and is located in CITY

P(P#, PNAME, COLOR, WEIGHT, CITY)
    Part number P# has name PNAME, colour COLOR,     weight WEIGHT in pounds and is located in CITY

SP(S#, P#, QTY)
    S# shipped QTY of P#

There are two obvious ways to navigate from S# to P#. One can either go via the shipments relation SP (where S# and P# appear in the same relation), or one can join relations S and P on S.CITY = P.CITY.

Presumably to define the Universal Relation one would be forced to pick one of these paths as the default, and rename attributes accordingly. E.g. we could arbitrarily assume the default relationship between suppliers and parts is through the shipments, and rename CITY as SCITY in S, and CITY as PCITY in P, so they have distinct roles in the Universal Relation.

It seems that explicit joins are a better way to give the user precise control over the access paths. However I find it interesting that a 3- way natural join on S,P and SP imposes both of the above navigation paths at the same time, and that's rarely what's required.

After thinking about that problem for a while, it occurred to me that access paths should emphasise navigation from relation to relation, rather than from attribute to attribute (as suggested by the hypergraphs  in the literature on the Universal Relation).

The following describes some ideas for a query language on a relational database based on path expressions that allow relation names to be specified in the path. It is more closely related to the relational algebra than the calculus. It's assumed there are no NULLs and queries work on sets not bags.

Relvars


The name of a base relvar evaluates to its current relation value. For example, the path expression

    P

evaluates to the current value of the relvar named P.

Restriction


A restriction can be taken using a postfix operator which is a boolean expression enclosed in square brackets and parameterised on the attributes of the relation. For example, the path expression

    P[COLOR = 'red']

evaluates to the restriction on P where parts are red.

For convenience it's assumed it's not necessary to specify a selector as in COLOR('red').

Projection


The following path expression subsequently takes the projection onto attributes P# and WEIGHT:

    P[COLOR = 'red'].(P#,WEIGHT)

Note that since path expressions evaluate to sets, we know that duplicates are removed from the projection.

When projecting onto a single attribute the brackets aren't required. For example,

    S.S#

is the projection of relation S onto attribute S#.

Join with projection


Let R1 and R2 be relations, then the path expression R1.R2 denotes the natural join of R1 and R2 followed by a projection onto the attributes of R2. For example,

    P[COLOR = 'red'].SP

retrieves the tuples in SP associated with the shipment of a red part.

The idea that R1.R2 can drop attributes from R1 may seem strange. A benefit arises for example in path expressions like R1.R2.R3 because it can avoid joining on attributes which happen to have the same name in R1 and R3. This can help to control the joins along a path. For example,

    P[COLOR = 'red'].SP.S.SNAME

retrieves the names of the suppliers who ship at least one red part, whereas

    P[COLOR = 'red'].S.SNAME

returns the names of suppliers that are located in a city where there are red parts.

This is exactly the distinction discussed in the introduction.

A brief comparison to other query languages


From Date's book, the following SQL retrieves supplier names for suppliers who supply at least one red part:

    SELECT DISTINCT S.SNAME
    FROM S
    WHERE S.S# IN

         ( SELECT SP.S#
           FROM   SP
           WHERE  SP.P# IN
                ( SELECT P.P#
                  FROM P
                  WHERE P.COLOR = 'red'));

The following expression in the relational algebra is somewhat more concise:

    ( ( ( P WHERE COLOR = 'red' )

        JOIN SP ) { S# } JOIN S ) { SNAME } Note that the above needs to project the result of the join of P and SP onto { S# } before joining with S in order to avoid the join on P.CITY = S.CITY. For those interested in the comparison, Date also provides the same query in the relational calculus and the domain calculus.

Evidently the path expression

    P[COLOR = 'red'].SP.S.SNAME

is simpler than the above alternatives.

Join without projection


Given relations R1,...,Rn, (R1,...,Rn) denotes the natural join of R1,...,Rn. For example,

    (S,P)

gets all combinations of supplier and part information such that they are located in the same city.

Aggregate operators


Let R be a relation and every attribute of R is defined on a domain that supports summation. Then the path expression R._at_sum denotes a relation with the same attribute names as R that contains a single tuple which maps each attribute name to the sum of that attribute over tuples from R. If R is empty then R._at_sum contains a single tuple full of zeros. For example,

    SP[P# = 'P2'].QTY._at_sum

retrieves the total quantity of shipments for part P2.

Similarly we define operators _at_min, @max, @avg, except there is no tuple in the output if the input relation is empty (i.e. has no tuples).

R._at_count evaluates to a relation with a single tuple with a single attribute which gives the number of tuples in R. For example,

    S._at_count

retrieves the total number of suppliers.

R._at_exists evaluates to a relation with a single tuple with a single boolean attribute that is true iff R is non-empty.

Generalisation of projection


It was stated earlier that R.(x1,...,xn) denotes a projection of R onto attributes named x1,...,xn. This can be regarded as a special case of something much more general - where we allow path expressions instead of just attribute names for each xi.

Formally the result is defined as a union of joins where the union is taken over the tuples t of R, and the joins are taken over the relations {t}.xi for each i where {t} denotes the relation consisting of the single tuple t. Typically the {t}.xi have no attributes in common so the joins are cartesian products.

For example,

    P.(COLOR, SP) evaluates to the relation which is the natural join of P and SP, projected onto the attributes COLOR,S#,P#,QTY.

To help understand this it may be helpful to pretend that the dot operator distributes over the comma separated list. i.e. we think of P.(COLOR, SP) as (P.COLOR, P.SP). However we don't simply take the cartesian product P.COLOR x P.SP. Instead we account for the shared prefix P in the paths which "correlate" these two path expressions. In other words, for each tuple from P that we use to navigate to its COLOR, we note that the same tuple is involved in the navigation from P to SP by joining on P.P# = SP.P#.

It is convenient to support a syntax that adds extra attributes to an existing relation. The path expression R.(*,x1,...,xn) is equivalent to R.(y1,...,ym,x1,...,xn) where y1,...,ym are the attributes of R.

The aggregate operators can be used to aggregating over groups. For example,

    P.(P#, SP.QTY._at_sum)

returns a relation where for each part, it returns the part number and the total shipment quantity (which could be zero if there are no shipments of that part).

Note that

    P.(P#, SP.QTY._at_min)

retrieves the minimum quantity of shipments for each part number, but only for parts where there exists at least one shipment by some supplier.

We can group over more than one attribute. For example,

    P.(PNAME,CITY).(*,P.WEIGHT._at_sum)

retrieves the sum of the weights, grouped over PNAME and CITY.

As a more complicated example,

    P.(P#, SP.(QTY,S.STATUS).(_at_min,_at_max,_at_avg))

retrieves for each part, the part number and the minimum, maximum and average shipment QTY, and the minimum, maximum and average supplier status. However for this to make sense we would need some convention to assign unique attribute names in the output.

Generalisation of restriction


So far it has have assumed that for the restriction R[b(x1,...,xn)], the boolean expression b can only be parameterised by attribute names x1,...,xn from R. This is generalised to allow the xi to be path expressions.

R[b(x1,...,xn)] is formalised as a restriction on R where for each tuple t in R, b is evaluated as a boolean function parameterised on the {t}.xi, which most generally are relations. If however, a {t}.xi is a relation containing a single tuple defined on a single attribute, then it can be treated as a scalar value within the boolean expression.

For example,

    P[SP.S._at_count > 1].P#

retrieves the part numbers of the parts supplied by more than one supplier.

    P[SP.S# = {'S2'}].P#

retrieves part numbers for parts that are shipped only by supplier S2.

    S[not SP.P[P#='P2']._at_exists].SNAME

or equivalently,

    S[not 'P2' in SP.P.P#].SNAME

retrieves names of suppliers that don't supply part P2.

Renaming


Renaming of attributes is supported. For example

    S.(CITY as SCITY, SP.P.CITY as PCITY)

retrieves all pairs of city names (SCITY, PCITY) such that a supplier in city SCITY supplies a part to city PCITY

Calculated attributes


calculated attributes are supported. For example,

    P.(P#, WEIGHT*454 as W)[W > 10000]

retrieves part number P# and weight W in grams for each part with weight > 10000 grams

Path expressions with global scope


A path expression that begins with '..' is assumed to have a global scope. For example:

    S[STATUS < ..S.STATUS._at_max].S#

retrieves supplier numbers for suppliers with status less than the current maximum status.

As another example

    S[SP.P# = ..P.P#].SNAME retrieves supplier names for suppliers who ship all parts Received on Fri Apr 03 2009 - 05:58:31 CEST

Original text of this message