Re: XQuery question

From: Sebastian Schaffert <schaffer_at_informatik.uni-muenchen.de>
Date: Fri, 16 May 2003 13:01:07 +0200
Message-ID: <ba2gjk$meh$1_at_minotaurus.cip.informatik.uni-muenchen.de>


Costin Cozianu wrote:

>>> The fact that is declarative is obvious, and if you want me to explain
>>> that yo you, I'm at a loss. Either get over it, or else stop the
>>> discussion.
>>
>>
>>
>> The point is, the expression you provide above is not obvious without
>> digging deeper into it's semantics. The fact that you would need to
>> explain it to me shows that it is not declarative, since I am quite
>> familiar with SQL and the relational database model.
>
> Yep, I didn't claim that SQL is perfectly designed either, still the
> WITH RECURSIVE expressions are not *that* difficult to grasp.

Granted, if you are a programmer. On the other hand, query languages really should be suitable for non-programmers as well, especially in the Web context where there is always a reasonable query that has not yet been implemented by some programmer.

>>>> Whether XQuery is relationally complete or not is of no relevance,
>>>> since it is not intended to be so, as the underlying data is not
>>>> relational, it is tree-shaped.
>>>>
>>>
>>> That's non-sense. Data is data. It comes from the plural of the latin
>>> "datum" which means a "given", therefore data is a collection of (given)
>>> facts . "Given facts" are proposition that are true. Relational
>>> databases hold a collection of facts and those have no "relational"
>>> shape. That's just non-sense good only for marketing brochures.
>>
>>
>> What you call a proposition is really considered a n-ary predicate over
>> atomic terms in logics. In the case of relational databases, these terms
>> are always constants and thus flat, while in general (first order
>> logic), the predicates may also range over complex terms (i.e. "trees").
>>
>
> When you bind the variables (free terms) of a predicate you get a
> proposition. Let's stop quibbling over terms. Let me reiterate again,
> the predicate correspond to the "header" (by Date's terminology)/schema
> of a relation. Each individual tuple is a proposition.
>
> And I think you are definitely wrong here: the values in the columns can
> be as complex as you like.

You can certainly have anything you like in the columns. But the relational model is then not capable of doing anything more than simply returning this complex value. Example:

I can have r(f(a),g(b)) in the database table r. Obviously, it is possible to return both f(a) and f(b) with a simple query. However, in the relational model I do not know that the first term is actually an f and the second term is actually a g, while this information would be useful in many cases.

In datalog, I can e.g. express a query as follows: s(X) :- r(X,Y).
with the meaning "s is a view that gives me all first columns of t".

What I cannot express in datalog is
s(X) :- r(f(X),Y).
with the meaning "s is a view that gives me all such data from the first columns that is contained in an f".

Or take for example these two rules:
s(f(X)) :- r(X,Y)
s(f(X)) :- s(X)

which could describe something like the "ancestor" relation.

> Relational model can't care less of what kind of domains you use for the
>columns.

Nonetheless, the values in the columns are considered atomic.

>>> If you're willing to show me a useful "tree-shaped" logic, I'd be very
>>> curious to find out more about that.
>>
>>
>>
>> Ok, take first order logic. It's data is tree shaped (terms made up of
>> functors and constant symbols). In contrast, "datalog" (or the
>> relational theory) is merely a subset of FOL with no functors, only
>> constants.
>>
>> FOL:
>> forall X. person(X) => person(father(X))
>>
>> Obviously, father('John') is a (simple) tree shaped data item.
>>
>> You cannot express such terms conveniently in datalog.
>>
>
> Well, for what is worth I don't see how you express the above in XML
> either,

To express this, XML needs a formula language, which it currently does not have (but I'll refer you to an article on this issue on Monday, it is just not yet published and needs some proof-reading). You were denying the existance of tree-shaped logics, and I gave you one. My point was not on the logical connectives, but instead the nesting of terms.

> although you can express that in standard SQL 99 as a constraint
> to be verified against data rather than data. I know how to express that
> in Prolog, however efforts to make the DBMSes a general purpose
> inference engine either failed or went out of steam (funding ?).

I think the reason is a different one. Restricting relational databases to "database logic" (which is not as expressive as FOL) avoids the problem of non-termination.

>> I think that rather you are confusing the different levels of
>> abstraction. Obviously, I can represent a graph structure as a binary
>> relation. On the other hand, this is merely an encoding of the "picture"
>> (or "shape") of the graph, where the latter can immediately be grasped
>> while it is almost impossible to even partially imagine the shape of the
>> graph when encoded as a binary relation. That's what I mean with
>> "declarativity".
>
> Wait a second doc. A graph (directed and simple which is what most
> people use for practical purposes, but other cases can be ahndled
> relationally as well) is the pair G=(V, E) with E <= VxV, since V is
> already implied by E we can cut out the redundancy and just say E <= VxV
> which is a binary relation.

True. As a side remark: Calculating a transitive relation over this graph is probably very inefficient because it involves a serious number of joins.

>
> I don't know for the life of me what you mean by the "shape" of the
> graph. If you mean the picture you draw on a paper, only *that* is at a
> different level of abstraction.

This is what I mean.

>
> I don't want to operate on the "picture" or the "shape" or the physical
> encoding, I want to operate as much as possible close to the
> mathematical definition of a mathematical concept.

Depends. If you are interested in the structure (i.e. the edges), yes. If you are interested in the values that you have in the nodes of some subgraph, then probably no, because the subgraph is much easier specified on the picture.

>
>>>
>>> In any case people when you say XML think about the <xml> </xml> data
>>> encoding no matter what you'd like them to think.
>>>
>> Well, if you ask my students, they will tell you.:-)
>>
>
> Well, maybe you have a paper or present here a brief summary so that I
> can "get" it what you mean by "XML data model".
>

To just name a few:

Semistructured Data Model for XML:

_at_Book{abiteboul,

  author =       {Serge Abiteboul and Peter Buneman and Dan Suciu},
  title =        {Data on the Web. From Relations to Semistructured Data and
XML},
  publisher = {Morgan Kaufmann},
  year = {2000},
}

Interesting read, gives you some non-relational arguments against XQuery and the XPath-based approach:

_at_inproceedings{ grahne99difference,

    author = {Gosta Grahne and Laks V. S. Lakshmanan},     title = {{On the Difference between Navigating Semi-structured Data and Querying It}},

    booktitle = {Proceedings of Workshop on Database Programming Languages},     pages = {271--296},
    year = {1999},
    url = {http://www.dcs.gla.ac.uk/dbpl99/final/GrahneLaks.ps} }

>>> Your opinion does not count as long as it is not backed by arguments.
 

>>
>> One cannot back subjectivity by arguments, and when one talks about
>> "intuitivity", it is inherently subjective. Your answer again shows
>> that you try to be insulting when you don't have arguments left.
>>
>
> When solving the same use case takes a handful of lines of code using
> relational algebra extended with transitive closure, while it may take
> screens of code using XQuery / XPATH / DOM API/ SAX APi whatever, that's
> where I can justifiably say that XML as a data model is unintuitive to
> unworkable.

No. It depends on the query language. We are developing the experimental language Xcerpt, which is similar to "Prolog for XML" or "Datalog for XML", i.e. a logic programming language using "pattern queries" instead of "path queries". While the latter impose many constraints on the control flow (i.e. more or less exact navigation through the data tree), a pattern-based approach "gives" the complete pattern and thus leaves much room for automatic query optimization and such things. In a sense, SQL is "pattern queries" for relational databases.

If you are interested, you might want to read the following: http://www.pms.informatik.uni-muenchen.de/publikationen/#PMS-FB-2002-11 and
http://www.pms.informatik.uni-muenchen.de/publikationen/#PMS-FB-2002-7

or any of the other related articles on that site.

>> Recursive descent is per definition the way to do transitive closure
>> over a graph, isn't it (take your favorite book on graph theory and
>> you'll see)?
>
> No, that's just implementation details. Transitive closure is a concept
> with a very simple definition which I already posted in this thread. As
> far as I have been tought defining concepts in terms of how you compute
> them is not the best way to do mathematics, and definitely it is not the
> declarative way that you claim.

As always, there are several ways to define things in mathematics. Transitive closure over a graph may be defined as follows (in FOL):

forall X, Y. edge(X,Y) => connected(X,Y) forall X, Y. (exists Y such that edge(X,Y) /\ connected(Y,Z)) =>

                connected(X,Z)

Written in a more natural language, this could be defined as follows:

"Given a Graph G=(V,E), where V are the vertices and E subset VxV are the edges. Two nodes X and Y are called connected iff: - (X,Y) in E
- for some Z, (X,Z) in E and Z and Y are connected. "

> Delarative means the "what" not the "how". The standard definiton of
> transitive closure is the what it is not the "how you calculate it", I
> refer you to any elementary introduction to algebra.

For most problems there is also a recursive (inductive) definition.

Besides, what is more declarative than the keyword "descendant-or-self" of XPath? Why do you need recursive transition to work with it?

>> It's just a matter of how I encode the graph. Your example
>> XML document is not the appropriate representation of graph structures
>> in a graph language, it is trying to "port" the insufficiencies of
>> relational databases to XML. In XML, I don't need to encode the graph as
>> binary relation, I can encode it as graph (under the assumption of a
>> reasonable link mechanism).
>>
>
> Just post here a better encoding for a graph, together with the query.

Ok, I'll give you a sample database and a query in Xcerpt syntax (which is deliberately not XML, but very similar to it):

persons {
  john _at_ person {
    name {"John"},

    email {"john_at_foo.org"},
    email {"john_at_foobar.net"},
    knows { ^joe },
    knows { ^sue }

  },
  mary _at_ person {
    name {"Mary"},
    email {"mary_at_bar.com"},
    knows { ^john }
  },
  jim _at_ person {
    name {"Jim"},
    email {"jim_at_foobar.net"},
    knows { ^mary }
  },
  ...
}

Note that to encode this information you'll usually need at least three relations in the relational model if you want to avoid anormalies.  

Admittedly, to encode the graph structure in XML or alike, you need to have a reference mechanism (i.e. binary relation). However, in many cases (as above), the decision where to put the link and where to use the parent-child relationship is rather natural (as above).

Here is a query for all person elements that Jim knows, or someone knows who Jim knows, and so on (Read: "give me all person elements that are reachable from the person element that has "Jim" as the text value of its name element and construct a new element known-by-jim listing all those persons).

cons { known-by-jim { all X } }
WHERE
query { person {

          name {"Jim"}, 
          knows { desc X as person }
        }
      }

Since I don't want to explain too much about Xcerpt's syntax, I have used a simplified form of it. If you want more details, please have a look at the publications I listed above.

> You're not going to kid me that trees are a good encoding for a graph
> while relations are not.

They are a good encoding for those graphs that are actually trees.:-) They may be a better encoding for some graphs that are not trees, while binary relations are probably more suitable in the general case.

Whether the encoding in binary relations is efficient is a different matter.

>
>>> And you also refuse to address the issue whether XQuery is relationally
>>> complete and at what cost (viz. query complexity, and runtime
>>> performance).
>>
>>
>>
>> I refuse to speculate about relational completeness of XQuery because:

I have thought a little about the issue, and I think (I have not verified it) that XQuery is in fact relationally complete, since it is even Turing-complete (which SQL is not) and it is thus possible to encode relational algebra in it (whether this is anywhere near practical is a different matter).

> Maybe you can speculate on the utility of relational algebra (plus the
> transitive closure) in solving real use cases for real users. If your
> data model is going to support those use cases than it automatically
> means that it will be relationally complete. Or else, you can speculate
> what relational operators are unuseful and we don't need to support them.

My point is that the relational operators fit the relational world, but not the "tree world" of XML. Thus it is meaningless to argue about relational completeness.

Completeness is always a criteria of an algorithm (or a "procedure") with respect to a theory (or semantics). If the theory of XML is different than the one of relational databases, then it is in no way appropriate to claim that an XML query language is not relationally complete, since this is like comparing apples with peaches. In the same way I can claim that SQL or the relational calculus is completely useless because it is not a procedure which is complete in first order logic.

If you like the algebraic approach better than the logic approach, you might want to have a look at

_at_Article{unql,

  author =       {Peter Buneman and Mary Fernandez and Dan Suciu},
  title =        {{{UnQL}: A Query Language and Algebra for Semistructured
Data Based on Structural Recursion}},
  journal =      {{VLDB Journal}},
   pages =        {76-110},
  year =         {2000},
  volume =       {9},
  number =       {1},

}

>
> I thought it was elementary that any new data model/query language
> proposal ought to be reltaionally complete (and in auseful sense of the
> word, not in the sense that for example "COBOL is turing complete").
>
> Or if you want to set aside let's say relational division, you ought to
> justify why you want to do that. Don't you agree ?

In some sense, yes. However, the operators need to be redefined to fit better into the XML world. XML is not about relations or sets of facts, so set operators are probably not appropriate in the same manner as they are in relational algebra.

>
>> - I never claimed that XQuery was relationally complete
>> - nor is it relevant, because XML is not relational (as I already
>> stated), instead it is a set of
>> terms
>> - nor do I particularly like XQuery; in fact I am working on *replacing*
>> XQuery with something
>> better
>>
>> As to cost:
>> - XQuery is merely a language specification which says nothing about
>> implementation. There are
>> very inefficient implementations, and also some which are better
>> - XQuery and XML Querying is a very new issue, so optimization is
>> currently just beginning to
>> become an issue
>> - XQuery, as a rather control-flow oriented language, does not leave
>> much room for automatic
>> query optimization (which is why we are working on a different
>> language)
>>
>> 7
>> I repeat it, I am NOT claiming that XQuery is good. I am claiming that
>> an inherently tree or graph-shaped data representation language is often
>> superior to flat relational tuples.
>
> If you said "sometimes superior", I might give you a pass. And second,
> you have to come first with those data models so that we have something
> we can talk about and compare.

Ok, take the semistructured data model (see Abiteboul et al, above).

>
>> I do NOT say that XML is a
>> particularly good representation for tree- or graph-shaped data (Lisp
>> S-expressions would achieve the same). However, XML is well
>> standardized, it has a broad community and despite its deficiencies
>> (like DTD, processing instructions, unfortunate link mechanisms, etc) is
>> a reasonably good representation for such data.
>>
>
> Sure, COBOL was standardizzed and had a broad user community, it was the
> kid on the block in its days.

And COBOL definately was an improvement over most of the other languages then in existance (perhaps except Fortran). Maybe we will not be talking about XML in 20 years. But then we have learned from investigating the principles of the data model.  

Sebastian Received on Fri May 16 2003 - 13:01:07 CEST

Original text of this message