Re: Self Joins and optimization

From: David Cressey <cressey73_at_verizon.net>
Date: Sat, 05 May 2007 19:53:25 GMT
Message-ID: <Vg5%h.166$83.81_at_trndny08>

"Brian Selzer" <brian_at_selzer-software.com> wrote in message news:MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net...
>
> "David Cressey" <cressey73_at_verizon.net> wrote in message
> news:ZE%_h.163$D52.68_at_trndny04...
> > In the "interpolation" thread, Brian has been expounding on the idea
that
> > he can discover algorithms that he knows to be correct, and that
> > outperform
> > anything an optimizer can generate. He's mentioned "self joins" and the
> > idea that he can get a job doen with only one read, where the automatic
> > solution generated in response to declarative code generates multiple
> > passes
> > through the data.
> >
> > My experience leads me to the opposite conclusion from that.
> >
> > Here's the questions I'd like the newsgroup to consider:
> >
> > What, if any, algorithms that are demonstrably correct are going to be
> > overlooked by the optimizer, although discoverable by a human coder?
Why
> > would an optimizer fail to discover the possibility of eliminating
> > self-joins? Are there any undecidability issues here?
> >
> Here's a maybe not-so-simple example:
>
> Given a table with the columns,
>
> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>
> where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire
heading,
>

I hope I can ask a bunch of questions without derailing your example. I need to understand the example before I can come up with either a counter argument, or agreement.

First, a matter of simple syntax: What does "#" mean? is it part of the attribute name, or does it indicate something about the attribute? Is this SQL or some other language?

> compute the total labor performed to date for each job step.
>

What is a job step? How is it identified?

> Notes: (1) an employee cannot be perfoming the same job step on the same
> machine at the same time, but (2) intervals can overlap, and when they do,
> labor should be distributed between them evenly. For example, an employee
> may be monitoring two or more machines during the same period of time, so
> his labor during that period should be divided evenly between the job
steps
> on each machine, but that same employee cannot be monitoring the same job
> step on the same machine more than once during the same period of time.
>

How do you guarantee that the data will not violate the constraints declared in the above notes? Are there declared constraints to prevent the insertion of cotradictory data, or is it just "careful programming" on the part of those who write the SW that writes the data to the DB?

If the constraints were declared, wouldn't this information be available to the optimizer?

> A procedural approach would require two passes through the data: One to
> compute labor per job step per employee, and one to compute the aggregate,
> labor per job step.
>
> A set-based approach would require a minimum of six passes through the
data:
> Determining the intervals during which each employee was performing a set
of
> job steps requires a left outer self-join, or two passes through the data.
> Computing the cardinality of each of those sets so that labor can be
> distributed evenly requires an aggregation and then a join--three
additional
> passes. Computing the aggregate, labor per job step, requires a final
pass.
>

Could you provide SQL for a set based query that computes the correct result? This would help me to understand the example better.

> Since the final pass is common to both approaches, the procedural approach
> effectively replaces five passes through the data with a single pass.
>
> Why would the optimizer fail? Because I don't think it's possible for the
> optimizer to detect every intrarelational dependency in a table, nor do I
> think it would be advisable, unless it were possible to enumerate them in
> the schema.
>
Why would you want to have dependencies in the data that were NOT enumerated in the schema (assuming I understand what you mean by "enumerated")? Deosn't the schema describe the data?

I hope it's not too much trouble to clear these issues up first. Before we get to the meat of it. Received on Sat May 05 2007 - 21:53:25 CEST

This message: [ Message body ]
Next message: Cimode: "Re: A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation"
Previous message: David Cressey: "Re: more closed-world chatter"
Maybe in reply to: David Cressey: "Self Joins and optimization"
In reply to Brian Selzer: "Re: Self Joins and optimization"
Next in thread: David Cressey: "Re: Self Joins and optimization"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message