Path: news.f.de.plusline.net!news-fra1.dfn.de!news.tele.dk!feed118.news.tele.dk!postnews.google.com!n76g2000hsh.googlegroups.com!not-for-mail
From: Kevin Kirkpatrick <kvnkrkptrck@gmail.com>
Newsgroups: comp.databases.theory
Subject: Re: Self Joins and optimization
Date: 5 May 2007 10:30:12 -0700
Organization: http://groups.google.com
Lines: 87
Message-ID: <1178386212.605808.145550@n76g2000hsh.googlegroups.com>
References: <ZE%_h.163$D52.68@trndny04>
NNTP-Posting-Host: 69.148.182.167
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Trace: posting.google.com 1178386212 29658 127.0.0.1 (5 May 2007 17:30:12 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Sat, 5 May 2007 17:30:12 +0000 (UTC)
In-Reply-To: <ZE%_h.163$D52.68@trndny04>
User-Agent: G2/1.0
X-HTTP-UserAgent: Opera/9.20 (X11; Linux i686; U; en),gzip(gfe),gzip(gfe)
Complaints-To: groups-abuse@google.com
Injection-Info: n76g2000hsh.googlegroups.com; posting-host=69.148.182.167;
   posting-account=mW7hBA0AAABTqVYqprbLgSusB9AyVeEZ
Xref: news.f.de.plusline.net comp.databases.theory:43798


David Cressey wrote:
> In the "interpolation" thread,  Brian has been expounding on the idea that
> he can discover algorithms that he knows to be correct,  and that outperform
> anything an optimizer can generate. He's mentioned "self joins"  and the
> idea that he can get a job doen with only one read,  where the automatic
> solution generated in response to declarative code generates multiple passes
> through the data.
>
> My experience leads me to the opposite conclusion from that.
>
> Here's the questions I'd like the newsgroup to consider:
>
> What, if any, algorithms that are demonstrably correct are going to be
> overlooked by the optimizer,  although discoverable by a human coder?  Why
> would an optimizer fail to discover the possibility of eliminating
> self-joins?  Are there any undecidability issues here?
>
> Aside:
>
> I know a little about the DEC Rdb/VMS optimizer.  Less than that about the
> Oracle CBO,  although the Oracle CBO appears to have been conceptually
> derived from the Rdb/VMS optimizer.  Much less than that about the Oracle
> RBO as of 1994.  The RBO appears to have been an optimizer in name only.
>
> While I've got a lot of positive opinions about various features of Oracle,
> the RBO was not one of those features.  Shortly after my cutover to Oracle,
> one of the tech types told me:  "All the cool people use hints."
>
> After looking at it in as much depth as I cared to,  my conclusion (which I
> kept to myself) was:  "No.  All the really cool people find a DBMS that
> doesn't require hints to do a good job."
>
> End of aside.

A simple case study: extend a relation by adding an attribute that
gives the rank of each tuple based on the ordering of a subset of
existing attributes of the relation.  With primitive relational
operators, the only way I'm aware of to express this is:

SELECT E1.NAME, E1.SAL, COUNT(*) RNK
FROM EMP E1 EMP E2
WHERE E2.SAL <= E1.SAL
GROUP BY E1.NAME, E1.SAL;

I'm not aware of any rationale that would allow an optimizer to
compute the above with anything faster than a O(n^2) algorithm.

However, the SQL "windowing function" version:

SELECT NAME, SAL, RANK() OVER (ORDER BY SAL) RNK
FROM EMP;

computes the relation in O(n log n) time (it does a sort, then a scan,
of the SAL field, to compute the RANK() function for each tuple).

I'd argue that windowing functions allow for declarative
specifications with which an optimizer will find algorithms at least
as fast as anything a coder could do with a cursor.  In fact, I
challenge anyone to post any data definition and hypothetical query
specification where there exists a coded algorithm with a more
efficient big-O than what an optimizer could deduce given a
declarative solution.

A side note: the "45-hour to 45-minute improvement" someone remarked
on may refer to a program I rewrote a couple years ago and described
on this board.  The program required a table be updated using a slew
of complex pro-rating, interpolating, missing-value defaulting, and
more based on data scattered over 15 or so tables.  It had been done
in many thousands of lines of code using procedural "bucket
variables", loading indexed arrays, etc.; yet I incorporated all of
the logic into a single query using windowing functions + standard
relational operators.  The procedural code, having undergone at least
a dozen tuning fixes over the 4 years the program had existed (some by
very talented programmers using coding tricks that took me hours to
figure out), was still several orders of magnitude slower (and far
buggier!) than the single SQL query which, having now been around 3
years, has yet to require a single tuning effort.

I'm not so certain that simple set and join operators, without using
window functions, can always be so competitive.  Returning to the case
study I posted above: I would love to see either a query which only
uses relational operators to specify the "ranking" query in such a way
that an optimizer would reasonably find a O(n log n) algorithm; or, a
reasoned analysis which argues that an optimizer should find a O(n log
n) algorithm using the specification *I* gave.

