Re: Is mysql a RDBMS ?

From: Heikki Tuuri <Heikki.Tuuri_at_innodb.com>
Date: Sun, 31 Aug 2003 08:53:00 GMT
Message-ID: <MHi4b.46$TR6.27_at_read3.inet.fi>


Lauri,

thank you! We get input to this discussion from a person who really knows practical database applications.

"Lauri Pietarinen" <lauri.pietarinen_at_atbusiness.com> kirjoitti viestissä news:bire7g$md8$1_at_nyytiset.pp.htv.fi...
> Hi Heikki,

...
> >of course self-joins occur in real-world applications. But we are still
> >waiting for you to provide a real-world example where the join size is
> >bloated with duplicate rows.
>
> I sympatize with you wanting to see a real world example demonstrating
> the usefulness of
> working with relations instead of bags regarding optimisation and it
> took me a while to find
> an (in my view) convincing example.
>
> I posted this previously and have pasted it here for easy reference:
>
>
http://groups.google.com/groups?q=g:thl3922164456d&dq=&hl=en&lr=&ie=UTF-8&selm=3E625162.4060203%40atbusiness.com&rnum=23
>
> Take the standard S, SP and P -tables. I want to hide the joins from
> the user so I create a view:
>
> CREATE VIEW V_S_SP_P
> ( S#, SNAME, QTY, P#, PNAME)
> AS
> SELECT S.S#, S.SNAME, SP.QTY, P.P#, P.PNAME
> FROM S, SP, P
> WHERE S.S# = SP.S# AND
> SP.P# = P.P#;
>
> There are lots of interesting questions that can be
> easily answered using this view.
>
> The question "give me all products that are supplied
> by somebody" is easy to state:
> SELECT P#, PNAME
> FROM V_S_SP_P;
>
> Oops! Got the same P#'s several times.
> Got to add DISTINCT.
>
> Ok, let's try
> SELECT DISTINCT P#, PNAME
> FROM V_S_SP_P;
>
> Correct answer, but why is performance
> so bad?
>
> Oh, the DBMS materialises the whole join
> (even reading S, which is not needed at all)
> and then groups by P# and PNAME!
>
> (Just tried it on SQLServer2000).
>
> Why can't it transform this query into
> SELECT P#, PNAME
> FROM P
> WHERE EXISTS
> (SELECT * FROM SP
> WHERE P.P# = SP.P#)?
>
> Is this a useful view? It sure is!

I assume you mean the big view which joins S, SP, P. It kind of denormalizes. That is why duplicate rows of P appear.

> Could I actually provide such a view
> to my end users or programmers?
>
> No I can't, because they have to
> 1) remember to use DISTINCT

Hmm... there is at least one additional reason not to provide such a view to programmers. Since it is denormalized, they might try to update the attributes of a supplier for just one row, receiving an obscure error message about the non-updateteability of the view. Or, they could try to insert into the view.

Another reason is that it is not an outer join. There may be parts which currently have no supplier. Users might be misled to thinking that the above query gives all parts, not just the parts supplied by somebody. Of course, these are well-known problems of a denormalized database.

> 2) worry about bad performance
> So they end up coding the joins and
> exists themselves.
>
> My (non scientific) claim is that because users and
> DBMS-builders are used to thinking in terms of
> (SQL-)bags, they don't even come to consider (e.g.) the usefulnes
> of hiding complexity in views, at least not to the
> extent that it is possible.

For the optimizer it is rather easy to see that the query on the view is equivalent to (I am omitting PNAME to simplify this):

SELECT DISTINCT P.P#
FROM S, SP, P
WHERE S.S# = SP.S# AND SP.P# = P.P#; I do not know why SQL Server 2000 fails to notice this.

The original question was whether multiset (= bag) languages, like SQL, are worse at optimizing real-world queries than mathematical relation languages, like the relational algebra. If SP is bigger than P, then the optimizer should notice that SELECT DISTINCT P# ... should be optimized as:

SELECT P#
   FROM P
   WHERE EXISTS

      (SELECT * FROM SP
          WHERE P.P# = SP.P#);

In a nested loop join of P and SP we can stop calculation in SP when we find the first row that matches the join condition, since we are not asking for any columns in SP except DISTINCT P#, which is already determined when we pick a row in P. And there is a foreign key constraint which tells us that every row in SP joins to a row in S. That is why we can drop S from the join altogether. We do not need to calculate the DISTINCT physically, because P# is unique in P.

Hmm... if the cardinality of SP were very small compared to P, then the best optimization would be:

SELECT DISTINCT P# FROM SP; There are foreign key constraints in SP which tell us that every row in SP joins to S and P. That is why we can omit S and P from the join.

I do not see how a mathematical relation language would make the optimization procedure above easier than a multiset language. This is a typical example of determining the join order of tables in a nested loop join. Looks like it makes no difference in the optimization procedure whether the user formulates his query in a multiset language or a mathematical relation language. That has been my impression since studying this a bit a couple of years ago. Chris Date and some others claim the opposite, but they should provide practical evidence.

> Note that a common complaint about SQL is that we
> "have to do all these stupid joins". Now I don't find
> joins that hard to deal with but had we better optimisers
> and "non-bag" query languages we (e.g. DBA) could hide the joins
> in views and expose these to programmers and end users
> much more than we do today hence RAISING THE LEVEL
> OF ABSTRACTION.
>
> I am reminded of the old story of the drunken man who
> seeks his lost car-keys under the lamp-post
> because the light is better there.
>
> regards,
> Lauri Pietarinen

Best regards,

Heikki Tuuri
Innobase Oy
http://www.innodb.com
Foreign keys, transactions, and row level locking for MySQL InnoDB Hot Backup - a hot backup tool for MySQL Download MySQL-4.0 from http://www.mysql.com Received on Sun Aug 31 2003 - 10:53:00 CEST

Original text of this message