Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sun, 06 May 2007 06:30:01 GMT
Message-ID: <JBe%h.16839$YL5.304_at_newssvr29.news.prodigy.net>


"David Cressey" <cressey73_at_verizon.net> wrote in message news:Vg5%h.166$83.81_at_trndny08...
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net...
>>
>> "David Cressey" <cressey73_at_verizon.net> wrote in message
>> news:ZE%_h.163$D52.68_at_trndny04...
>> > In the "interpolation" thread, Brian has been expounding on the idea
> that
>> > he can discover algorithms that he knows to be correct, and that
>> > outperform
>> > anything an optimizer can generate. He's mentioned "self joins" and
>> > the
>> > idea that he can get a job doen with only one read, where the
>> > automatic
>> > solution generated in response to declarative code generates multiple
>> > passes
>> > through the data.
>> >
>> > My experience leads me to the opposite conclusion from that.
>> >
>> > Here's the questions I'd like the newsgroup to consider:
>> >
>> > What, if any, algorithms that are demonstrably correct are going to be
>> > overlooked by the optimizer, although discoverable by a human coder?
> Why
>> > would an optimizer fail to discover the possibility of eliminating
>> > self-joins? Are there any undecidability issues here?
>> >
>> Here's a maybe not-so-simple example:
>>
>> Given a table with the columns,
>>
>> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>>
>> where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire
> heading,
>>
>
> I hope I can ask a bunch of questions without derailing your example. I
> need to understand the example before I can come up with either a counter
> argument, or agreement.
>
> First, a matter of simple syntax: What does "#" mean? is it part of the
> attribute name, or does it indicate something about the attribute? Is
> this
> SQL or some other language?
>
It's just part of the attribute name.
>
>> compute the total labor performed to date for each job step.
>>
>
> What is a job step? How is it identified?
>
A job step is one of the steps required to complete a job, or work order. The distinction between a job and a work order is not material to the discussion. A job step is identified by the conjunction of a value for each of the attributes JOB# and STEP#.
>
>> Notes: (1) an employee cannot be perfoming the same job step on the same
>> machine at the same time, but (2) intervals can overlap, and when they
>> do,
>> labor should be distributed between them evenly. For example, an
>> employee
>> may be monitoring two or more machines during the same period of time, so
>> his labor during that period should be divided evenly between the job
> steps
>> on each machine, but that same employee cannot be monitoring the same job
>> step on the same machine more than once during the same period of time.
>>
>
> How do you guarantee that the data will not violate the constraints
> declared
> in the above notes? Are there declared constraints to prevent the
> insertion
> of cotradictory data, or is it just "careful programming" on the part of
> those who write the SW that writes the data to the DB?
>

I would probably use a trigger. It would be nice to be able to declare a temporal constraint, but I expect it would have to be applied to a view because of the nature of the data in the table. There is one other constraint that I didn't mention: for every row with 'OFF,' there must exist a row with 'ON' such that EMPLOYEE#, JOB#, STEP#, MACHINE# are the same and that TX_TIME in the row containing 'ON' < TX_TIME in the row containing 'OFF'.
> If the constraints were declared, wouldn't this information be available
> to
> the optimizer?
>

I would assume so, I can't think of how either of the constraints could make a difference in this case.

>
>> A procedural approach would require two passes through the data: One to
>> compute labor per job step per employee, and one to compute the
>> aggregate,
>> labor per job step.
>>
>> A set-based approach would require a minimum of six passes through the
> data:
>> Determining the intervals during which each employee was performing a set
> of
>> job steps requires a left outer self-join, or two passes through the
>> data.
>> Computing the cardinality of each of those sets so that labor can be
>> distributed evenly requires an aggregation and then a join--three
> additional
>> passes. Computing the aggregate, labor per job step, requires a final
> pass.
>>
>
> Could you provide SQL for a set based query that computes the correct
> result? This would help me to understand the example better.
>

In order to make things easier to follow, instead of a single query, I split the query up by using a couple views.

The table definition:

CREATE TABLE TX
(
 EMPLOYEE# INT NOT NULL,
 JOB# INT NOT NULL,
 STEP# INT NOT NULL,
 MACHINE# INT NOT NULL,
 TX_TYPE CHAR(3) NOT NULL
  CHECK (TX_TYPE = 'ON' OR TX_TYPE = 'OFF'),  TX_TIME DATETIME NOT NULL,
  PRIMARY KEY (EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TYPE, TX_TIME) )

A little bit of test data:

INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 08:00:00')
INSERT TX VALUES (1,2,1,1,'ON', '2007-05-03 08:00:00')
INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 09:00:00')
INSERT TX VALUES (1,2,1,1,'OFF', '2007-05-03 09:30:00')
INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 10:00:00')
INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 11:00:00')
INSERT TX VALUES (1,1,2,1,'ON', '2007-05-03 11:30:00')
INSERT TX VALUES (1,1,2,1,'OFF', '2007-05-03 12:00:00')
INSERT TX VALUES (1,2,2,1,'ON', '2007-05-03 13:00:00')
INSERT TX VALUES (1,2,2,1,'OFF', '2007-05-03 14:00:00')

This view pairs up the rows in TX. For pairs without an 'OFF' row, substitute the current time for TIME_OFF:

CREATE VIEW PAIRED_TX AS

SELECT a.EMPLOYEE#,
       a.JOB#,
       a.STEP#,
       a.MACHINE#,
       TX_TIME AS TIME_ON,
       COALESCE(
         (SELECT MIN(TX_TIME)
            FROM TX
            WHERE EMPLOYEE# = a.EMPLOYEE#
              AND JOB# = a.JOB#
              AND STEP# = a.STEP#
              AND MACHINE# = a.MACHINE#
              AND TX_TIME > a.TX_TIME
              AND TX_TYPE = 'OFF'),
       CURRENT_TIMESTAMP) AS TIME_OFF

    FROM TX a
    WHERE TX_TYPE = 'ON' This view enumerates the intervals per employee. Note an additional row with STOP_TIME as the current time. This is needed to deal with unpaired rows in TX.

CREATE VIEW INTERVALS AS
SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME

    FROM

      (SELECT DISTINCT a.EMPLOYEE#,
          a.TX_TIME AS START_TIME,
          (SELECT MIN(TX_TIME)
            FROM TX
            WHERE EMPLOYEE# = a.EMPLOYEE#
              AND TX_TIME > a.TX_TIME) AS STOP_TIME
         FROM TX a
       UNION
       SELECT EMPLOYEE#,
           MAX(TX_TIME) AS START_TIME,
           CURRENT_TIMESTAMP AS STOP_TIME
       FROM TX
       GROUP BY EMPLOYEE#
       HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) a
  WHERE STOP_TIME IS NOT NULL Here's the query: Sorry about the formatting. If you paste it into your favorite fixed font editor, it will be more readable. You may need a substitute for DATEDIFF for your favorite DBMS. This works on Sql Server.

SELECT JOB#,

       STEP#,
       SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS) / 
3600.0 AS LABOR
  FROM
--this part of the query calculates the number of paired rows that occupy each interval

    (SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS PAIRS

      FROM PAIRED_TX a
        JOIN INTERVALS b
          ON (b.EMPLOYEE# = a.EMPLOYEE#
              AND b.START_TIME >= a.TIME_ON
              AND b.STOP_TIME <= a.TIME_OFF)
      GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) c
    JOIN
--this part of the query splits each pair into individual intervals.

      (SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME, STOP_TIME

          FROM PAIRED_TX a
            JOIN INTERVALS b
              ON (b.EMPLOYEE# = a.EMPLOYEE#
                  AND b.START_TIME >= a.TIME_ON
                  AND b.STOP_TIME <= a.TIME_OFF)) d
      ON (d.EMPLOYEE# = c.EMPLOYEE#
          AND d.START_TIME = c.START_TIME
          AND d.STOP_TIME = c.STOP_TIME)
  GROUP BY JOB#, STEP# The results from the supplied test data: JOB# STEP# LABOR
-------- -------- ----------
1           1           1.5
2           1           1.0
1           2           0.5
2           2           1.0


>> Since the final pass is common to both approaches, the procedural
>> approach
>> effectively replaces five passes through the data with a single pass.
>>
>> Why would the optimizer fail? Because I don't think it's possible for
>> the
>> optimizer to detect every intrarelational dependency in a table, nor do I
>> think it would be advisable, unless it were possible to enumerate them in
>> the schema.
>>
> Why would you want to have dependencies in the data that were NOT
> enumerated
> in the schema (assuming I understand what you mean by "enumerated")?
> Deosn't the schema describe the data?
>

There may be an implicit order to the data, such as is the case with temporal data.

>
> I hope it's not too much trouble to clear these issues up first. Before
> we
> get to the meat of it.
>
>
>
Received on Sun May 06 2007 - 08:30:01 CEST

Original text of this message