Re: Self Joins and optimization

From: Kevin Kirkpatrick <kvnkrkptrck_at_gmail.com>
Date: 5 May 2007 10:30:12 -0700
Message-ID: <1178386212.605808.145550_at_n76g2000hsh.googlegroups.com>


David Cressey wrote:
> In the "interpolation" thread, Brian has been expounding on the idea that
> he can discover algorithms that he knows to be correct, and that outperform
> anything an optimizer can generate. He's mentioned "self joins" and the
> idea that he can get a job doen with only one read, where the automatic
> solution generated in response to declarative code generates multiple passes
> through the data.
>
> My experience leads me to the opposite conclusion from that.
>
> Here's the questions I'd like the newsgroup to consider:
>
> What, if any, algorithms that are demonstrably correct are going to be
> overlooked by the optimizer, although discoverable by a human coder? Why
> would an optimizer fail to discover the possibility of eliminating
> self-joins? Are there any undecidability issues here?
>
> Aside:
>
> I know a little about the DEC Rdb/VMS optimizer. Less than that about the
> Oracle CBO, although the Oracle CBO appears to have been conceptually
> derived from the Rdb/VMS optimizer. Much less than that about the Oracle
> RBO as of 1994. The RBO appears to have been an optimizer in name only.
>
> While I've got a lot of positive opinions about various features of Oracle,
> the RBO was not one of those features. Shortly after my cutover to Oracle,
> one of the tech types told me: "All the cool people use hints."
>
> After looking at it in as much depth as I cared to, my conclusion (which I
> kept to myself) was: "No. All the really cool people find a DBMS that
> doesn't require hints to do a good job."
>
> End of aside.

A simple case study: extend a relation by adding an attribute that gives the rank of each tuple based on the ordering of a subset of existing attributes of the relation. With primitive relational operators, the only way I'm aware of to express this is:

SELECT E1.NAME, E1.SAL, COUNT(*) RNK
FROM EMP E1 EMP E2
WHERE E2.SAL <= E1.SAL
GROUP BY E1.NAME, E1.SAL; I'm not aware of any rationale that would allow an optimizer to compute the above with anything faster than a O(n^2) algorithm.

However, the SQL "windowing function" version:

SELECT NAME, SAL, RANK() OVER (ORDER BY SAL) RNK FROM EMP; computes the relation in O(n log n) time (it does a sort, then a scan, of the SAL field, to compute the RANK() function for each tuple).

I'd argue that windowing functions allow for declarative specifications with which an optimizer will find algorithms at least as fast as anything a coder could do with a cursor. In fact, I challenge anyone to post any data definition and hypothetical query specification where there exists a coded algorithm with a more efficient big-O than what an optimizer could deduce given a declarative solution.

A side note: the "45-hour to 45-minute improvement" someone remarked on may refer to a program I rewrote a couple years ago and described on this board. The program required a table be updated using a slew of complex pro-rating, interpolating, missing-value defaulting, and more based on data scattered over 15 or so tables. It had been done in many thousands of lines of code using procedural "bucket variables", loading indexed arrays, etc.; yet I incorporated all of the logic into a single query using windowing functions + standard relational operators. The procedural code, having undergone at least a dozen tuning fixes over the 4 years the program had existed (some by very talented programmers using coding tricks that took me hours to figure out), was still several orders of magnitude slower (and far buggier!) than the single SQL query which, having now been around 3 years, has yet to require a single tuning effort.

I'm not so certain that simple set and join operators, without using window functions, can always be so competitive. Returning to the case study I posted above: I would love to see either a query which only uses relational operators to specify the "ranking" query in such a way that an optimizer would reasonably find a O(n log n) algorithm; or, a reasoned analysis which argues that an optimizer should find a O(n log n) algorithm using the specification *I* gave. Received on Sat May 05 2007 - 19:30:12 CEST

Original text of this message