Re: Is mysql a RDBMS ?

From: Bob Badour <bbadour_at_golden.net>
Date: Mon, 25 Aug 2003 21:10:35 -0400
Message-ID: <5Az2b.699$gN.82067528_at_mantis.golden.net>


"Heikki Tuuri" <Heikki.Tuuri_at_innodb.com> wrote in message news:Rpx2b.4189$G37.2895_at_read3.inet.fi...
> Bob,
>
> "Bob Badour" <bbadour_at_golden.net> kirjoitti viestissä news:Nws2b.656
> ...
> >
> > http://www.acm.org/classics/nov95/s1p3.html
> >
> > A set of tuples. You may find it interesting to note the original paper
> > relied on column ordering; however, relational proponents have since
> > discarded that idea.
>
> "
> The term relation is used here in its accepted mathematical sense. Given
> sets S1, S1, ···, Sn, (not necessarily distinct), R is a relation on these
n
> sets if it is a set of n-tuples each of which has its first element from
S1,
> its second element from S1, and so on. We shall refer to Sj as the jth
> domain of R. As defined above, R is said to have degree n. Relations of
> degree 1 are often called unary, degree 2 binary, degree 3 ternary, and
> degree n n-ary.
> "
>
> ok, it is a rather common mathematical definition of a relation as a
subset
> of S1 x S2 x ... x Sn.
>
> > I think most queries in real applications include at least one candidate
> > key. That is, the cost of DISTINCT in most projections is zero.
>
> This requires some work from the optimizer to notice.

It requires some work from the optimizer to optimize. Do you have a point? Are you suggesting that first doing the work to notice the table is a relation and then doing the work to notice that distinct is a no op is somehow less expensive?

> Also note that SQL in
> these cases acts like a mathematical-relation language, and you cannot
> expect any speedup from using a mathematical-relation database.

Provided the table has a key declared, SQL can perform this same optimization. If the table is a multiset, SQL cannot.

> > > If you have evidence of this, you should publish it in an academic
> > refereed
> > > journal. This is a claim of some Codd-12-relationists, but I do not
> recall
> > > seeing refereed papers published about this. Have you got any
> references?
> >
> > Citeseer is temporarily down right now so I cannot search for past
papers.
> I
> > have serious doubts whether a referee would consider the topic
> sufficiently
> > novel and unapparent to warrant publication.
>
> It is unapparent. Date has provided some contrived examples, but it would
> require a careful analysis of a real-world application to show that a
> mathematical-relation language really can calculate typical queries faster
> than SQL. People who claim that, should do the study and get it published
in
> a refereed scientific journal. My guess is that SQL is not slower, and
those
> Codd-12-relationists simply are wrong.

You don't have a particularly credible reputation for guessing correctly.

> > Just off the top of my head, I can think of the following identities
that
> > improve performance for relations but not necessarily for multisets:
> >
> > COUNT(A UNION B) = COUNT(A) + COUNT(B) - COUNT(A JOIN B)
> >
> > A WHERE EXISTS ( B WHERE P(A,B) AND Q(B) )
> > = ( A JOIN ( B' WHERE P(A,B') AND Q(B') )) PROJECT { ATTRIB(A) }
> > (B' indicates the dbms may have to rename some attributes of B)
> >
> > ( A(w,u) JOIN B(w,v) ) PROJECT { w }
> > = ( A(w,u) PROJECT { w } ) JOIN ( B(w,v) PROJECT { w } )
> >
> > ( A UNION A ) = A
> >
> > ( A JOIN A ) = A
> >
> > ( A INTERSECT A ) = A
>
> The above formulas are not that often used in query optimization.

I have no doubt they are seldom used to optimize SQL, because in general none of them apply to SQL. They do, however, apply to relations, and are very useful for optimization--especially for optimizing queries referencing views.

> Usually we
> have joins (e.g., the EXISTS above can be seen as a special 'join' where
we
> stop calculation in B when we have found 1 row), and the problem is in
> determining the best order for joining several tables using a nested loop
> join.

I agree about the exists and shortcutting the logic. It is an important optimization that always applies to relations and only sometimes applies to bags. That makes the relational optimizer at least one "if" less complex.

The distribution of project over join reduces the number of rows to consider for the join operation.

Replacing a union with an intersection can dramatically improve performance if the dbms can perform an index merge.

> I do not see how a mathematical-relation language would make this join
> optimization easier. Looks like in real-world SQL queries duplicate rows
> usually do not play any role at all in the calculation process.

Hello! McFly! Is anybody home?

> > > > SQL seems to spend a lot of time generating meaningless duplicates
> that
> > > are
> > > > seldom used in any case.
> > >
> > > No, it does not spend time.
> >
> > What do you mean it does not spend time?
>
> I mean that if we calculate a projection of a table to some columns, then
it
> is faster just to output the multiset than use sorting to eliminate
> duplicates.

Why would you assume a sort is required?

Even with the sort, it would depend on the number of duplicates and the speed of the output device, but those are not the particular duplicates I was considering. Joining multisets can dramatically increase the degree of duplication. Remember the table with three rows of 1? What does the self-join of that table give?

> Of course, if we feed this output to another join, then those
> duplicates can cause a lot of extra work. If we feed the output to an
> aggregate function like SUM(), then SQL actually works faster than a
> mathematical-relation language, because the latter would need to keep the
> primary key in the output to make it a set of distinct tuples.

Not necessarily. If some later step makes the key unimportant (eg. if a later step projects away those attributes), the dbms does not need to propagate the key and the dbms can use internal physical location to keep the tuples distinct.

> > Any time it must continue
> > enumerating data looking for potential duplicates is time wasted.
Consider
> > the algorithm that SQL must use to evaluate the equivalent of ( A JOIN
A )
> > due to the possibility of duplicates. Relationally ( A JOIN A ) is a no
op
> > because the result is always A.
> >
> > What does it really mean when the 3 rows of 1 above become 9 rows of 1
or
> 1
> > row of 1, but cannot become 3 rows of 1? The relational cost is zero and
> the
> > SQL cost is guaranteed greater than zero.
>
> The example is not a real-world query.

What's not real about it? The fact that it is a self-join? That happens all the time--especially when joining two views that each join the same base table. Or the fact that it has duplicate rows? I think duplicates are a dumb idea in the first place; you are the one arguing for their use.

SQL must either perform work to generate 9 rows from 3 rows or it must perform work to generate 1 row from 3 rows. A relational dbms does not have to do any of this work. Duplicates impede optimization. Received on Tue Aug 26 2003 - 03:10:35 CEST

Original text of this message