Re: Is this a bug or a limitation of the SQL language

From: Joseph Hellerstein <hellers_at_wisc.edu>
Date: 26 Aug 92 00:13:05 GMT
Message-ID: <HELLERS.92Aug25181305_at_cleo.wisc.edu>


In article <BtJKIG.2K5_at_watdragon.uwaterloo.ca> gnpaulle_at_maytag.uwaterloo.ca (Glenn Paulley) writes:
> > [I wrote:]
> > |select * from emp
> > |where salary between
> > |(select salary from emp where ename = 'Larry')
> > |and
> > |(select salary from emp where ename = 'John')
 

> > You want to avoid subqueries whenever possible. To get what you need,
> > try the following:
>
> 2) Sahil's query, if optimized using subqueries for the where predicate,
> will probably execute much faster than doing a three-way join (the
> conjuncts in the where clause are single-value lookups; if ename is
> indexed, for example, the entire result can be found using two indexed
> retrievals, and a single table scan.... no joins, no sorts).

This is incorrect. Subquery processing is identical to *one kind* of join processing, namely nested-loop with the subquery as the inner. Hence joins will always do at least as well as subqueries, and often better, since the space of paths for the optimizer to choose from is larger. To argue that subquery processing is simpler than join processing is just wrong. It's a kind of join processing itself, with restricted output semantics.

> Avoiding
> subqueries altogether, or always attempting to transform subqueries to
> joins, is a dangerous heuristic.

A heuristic, yes, as noted below (duplicate elimination may be required). Dangerous? Not really. For exactly the reason I outlined above, subquery-to-join is a stable conversion, meaning that it will almost always do as well, and often better. To do duplicate semantics correctly, a sort or hash is sometimes required after transformation (SELECT DISTINCT instead of SELECT), and this may result in some slowdown. But the win from conversion can result in orders-of-magnitude improvement, while the loss of an extra sort (say) usually slows things down by much less (in the < 100% slowdown range).

> RDBMS optimizers should also determine
> if transforming a join into a subquery will result in a better execution
> plan.

This can be done, but as noted above, this logic is not essential since subquery-to-join conversion is quite stable. Since optimization time is already dangerously complex, adding this dimension is probably not worth it.

>
> Having the *user* perform query rewrite for "optimization" is unacceptable
> (it's what query optimizers in RDBMS are all about). Sahil's query is
> straightforward and specifies exactly what he wants. It's not his fault
> if today's query optimizers fail to execute this query in the most
> optimal way.

Absolutely 100% agreed. This is a serious failing in the existing SQL products, namely that different expressions of the same query may have wildly different performance. As a result, if you open up a vendor's manual, you will usually find such "tips" as: "convert subqueries to joins in the case where blah blah blah". Naturally these tips will stymie all but the most sophisticated users. And every shop needs such a user, or they're in trouble. The Starburst research project at IBM Almaden may result in IBM's products having such optimizations be automatic some years from now.

> It's also a dangerous practice
> to generalize (that is, transform subqueries to joins) because not all
> transformations result in equivalent queries (particularly when the
> database contains null values). See the paper by Negri et al.
> in ACM TODS, Vol 17, No. 3, Sept. 1991 for formal semantics of entry-level
> SQL and query equivalence classes.

And see the paper by Pirahesh, Hellerstein, and Hasan in SIGMOD '92 which demonstrates that subquery-to-join *can* be done correctly (and is done automatically in Starburst) for all existential subqueries that are Boolean Factors (i.e. IN, ANY, or EXISTS subqueries that are outer conjuncts of the WHERE clause, not nested under NOT or OR, etc.)

You have to be careful, but the rewrite I gave him will work correctly, may well speed up his query, and may even sneak by his limited parser.

Happy rewriting.

Joe Hellerstein Received on Wed Aug 26 1992 - 02:13:05 CEST

Original text of this message