Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sat, 05 May 2007 17:06:52 GMT
Message-ID: <MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net>


"David Cressey" <cressey73_at_verizon.net> wrote in message news:ZE%_h.163$D52.68_at_trndny04...
> In the "interpolation" thread, Brian has been expounding on the idea that
> he can discover algorithms that he knows to be correct, and that
> outperform
> anything an optimizer can generate. He's mentioned "self joins" and the
> idea that he can get a job doen with only one read, where the automatic
> solution generated in response to declarative code generates multiple
> passes
> through the data.
>
> My experience leads me to the opposite conclusion from that.
>
> Here's the questions I'd like the newsgroup to consider:
>
> What, if any, algorithms that are demonstrably correct are going to be
> overlooked by the optimizer, although discoverable by a human coder? Why
> would an optimizer fail to discover the possibility of eliminating
> self-joins? Are there any undecidability issues here?
>
Here's a maybe not-so-simple example:

Given a table with the columns,

EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire heading,

compute the total labor performed to date for each job step.

Notes: (1) an employee cannot be perfoming the same job step on the same machine at the same time, but (2) intervals can overlap, and when they do, labor should be distributed between them evenly. For example, an employee may be monitoring two or more machines during the same period of time, so his labor during that period should be divided evenly between the job steps on each machine, but that same employee cannot be monitoring the same job step on the same machine more than once during the same period of time.

A procedural approach would require two passes through the data: One to compute labor per job step per employee, and one to compute the aggregate, labor per job step.

A set-based approach would require a minimum of six passes through the data: Determining the intervals during which each employee was performing a set of job steps requires a left outer self-join, or two passes through the data. Computing the cardinality of each of those sets so that labor can be distributed evenly requires an aggregation and then a join--three additional passes. Computing the aggregate, labor per job step, requires a final pass.

Since the final pass is common to both approaches, the procedural approach effectively replaces five passes through the data with a single pass.

Why would the optimizer fail? Because I don't think it's possible for the optimizer to detect every intrarelational dependency in a table, nor do I think it would be advisable, unless it were possible to enumerate them in the schema.

[snip] Received on Sat May 05 2007 - 19:06:52 CEST

Original text of this message