Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Wed, 09 May 2007 16:15:01 GMT
Message-ID: <9sm0i.8051$rO7.15_at_newssvr25.news.prodigy.net>

"David Cressey" <cressey73_at_verizon.net> wrote in message news:uBj0i.14554$rm.10970_at_trndny03... [snip]
>
> Brian,
>
> I'm not ignoring your example. I'm trying to digest it, a little bit at
> a
> time. As you said, it's not quite so simple. I have no response at this
> time, other than vague philosophical generalities. I think it would be
> insulting to offer such generalities in response to your input of code and
> data.
>
> So I'm hoping, before too long, to have something more specific to say.
>
> In the meantime, your example provides me much food for thought about
> temporal data, transition versus state, and simultaneity.
>

Temporal data can really make you tear what little hair you have left out. I bought Date, Darwin and Lorentzos' /Temporal Data and the Relational Model/ recently, and found that there really isn't much in it that I hadn't already figured out on my own. I also think some of their arguments are lacking. Their arguments against using NOW for present data are totally unfounded, and I find their solution, splitting current and historical data into
separate relvars abhorrent. First of all, an interval of the form [NOW, d16]
does not make any sense in a closed world. Secondly, why couldn't there be different kinds of intervals: definite, indefinite start, indefinite end, and
universal. (They certainly allow for different granularities and precision.)
In a closed world, an indefinite interval would have to refer to either positive
or negative infinity, and even given the discreteness assumption, the number of possible timestamps is countably infinite. The introduction of indefinite
intervals would eliminate the need for separate relvars, and would also allow
invariant information to be recorded in the same relvar. The presence of the interval (-infinity,+infinity) would clearly indicate that the information represented by the tuple applies for all time, and as such, must never change.
This could even be enforced by the engine. So their solution is the result of
a faulty argument and a poorly designed interval data type.

Sorry for diverging from the topic at hand. Below is a procedural solution. It took a bit of extra time to write, because there are so many conditions. Also, in order to make the set-based solution easier to understand, I'm reincluding it here further annotated for comparison with the procedural solution.

/*
Need to cache CURRENT_TIMESTAMP so that apples and apples can be compared */

DECLARE _at_CURRENT_TIME DATETIME SET @CURRENT_TIME = CURRENT_TIMESTAMP
DECLARE _at_NUMJOBS INT
DECLARE _at_PREV_EMPLOYEE INT,
  _at_NEXT_EMPLOYEE INT,
  _at_PREV_TX_TIME DATETIME,
  _at_NEXT_TX_TIME DATETIME,
  _at_PREV_TX_TYPE CHAR(3),
  _at_NEXT_TX_TYPE CHAR(3),
  _at_PREV_JOB INT,
  _at_NEXT_JOB INT,
  _at_PREV_STEP INT,
  _at_NEXT_STEP INT,

_at_PREV_MACHINE INT,
_at_NEXT_MACHINE INT
DECLARE _at_X CURSOR

/*
First build a set of intervals and calculate the number of jobs an employee is working on during each.
Use the cached _at_CURRENT_TIME for the final interval if the employee is currently working.
This takes one pass through the data in order to produce the table variable _at_I.
Caching the result into a table variable and then reading it out in the final join is similar to
a Table Spool/Eager Spool optimizer step. Which in my judgement would be required in
a fully-optimized set-based solution as well. */
EMPLOYEE#, TX_TIME, TX_TYPE

            |
           V

EMPLOYEE#, START_TIME, STOP_TIME, NUMJOBS */
DECLARE _at_I TABLE
(
EMPLOYEE# INT NOT NULL,
START_TIME DATETIME NOT NULL,
STOP_TIME DATETIME NOT NULL,
NUMJOBS INT NOT NULL,
PRIMARY KEY (EMPLOYEE#, START_TIME, STOP_TIME) )

SET _at_X = CURSOR LOCAL STATIC FORWARD_ONLY FOR SELECT EMPLOYEE#, TX_TIME, TX_TYPE
FROM TX
ORDER BY EMPLOYEE#, TX_TIME, TX_TYPE DESC OPEN _at_X
FETCH _at_X INTO @PREV_EMPLOYEE, @PREV_TX_TIME, @PREV_TX_TYPE IF _at__at_FETCH_STATUS = 0
BEGIN
IF _at_PREV_TX_TYPE = 'ON'
BEGIN
  SET _at_NUMJOBS = 1
  FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_TX_TIME, @NEXT_TX_TYPE   WHILE _at__at_FETCH_STATUS = 0
  BEGIN
   IF _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE    BEGIN
    IF _at_PREV_TX_TIME < @NEXT_TX_TIME

     IF _at_NUMJOBS > 0
      INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @NEXT_TX_TIME,

_at_NUMJOBS)

    SET _at_NUMJOBS = @NUMJOBS + CASE WHEN @NEXT_TX_TYPE = 'ON' THEN 1 ELSE -1 END
   END ELSE BEGIN
    IF _at_NUMJOBS > 0
     INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME, _at_NUMJOBS)

SET _at_NUMJOBS = 0
END

   SELECT _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE,
       _at_PREV_TX_TIME = @NEXT_TX_TIME,
       _at_PREV_TX_TYPE = @NEXT_TX_TYPE

   FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_TX_TIME, @NEXT_TX_TYPE   END
  IF _at_NUMJOBS > 0
   INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME, @NUMJOBS) END
END
CLOSE _at_X
DEALLOCATE _at_X

/*
Now pair up the rows so that there is only one row per employee, job, step, machine, and interval.
Use the cached _at_CURRENT_TIME to complete the intervals if necessary. This takes one pass through the data to produce the table variable _at_P. Again, in my judgement, the Table Spool/Eager Spool optimizer step would exist in a fully optimized set-based solution. EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE

                        |
                       V

EMPLOYEE#, START_TIME, STOP_TIME, JOB#, STEP#, MACHINE# */
DECLARE _at_P TABLE
(
EMPLOYEE# INT NOT NULL,
START_TIME DATETIME NOT NULL,
STOP_TIME DATETIME NOT NULL,
JOB# INT NOT NULL,
STEP# INT NOT NULL,
MACHINE# INT NOT NULL,
PRIMARY KEY (EMPLOYEE#, START_TIME, STOP_TIME, JOB#, STEP#, MACHINE#) )

SET _at_X = CURSOR LOCAL FORWARD_ONLY FOR SELECT EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE FROM TX
ORDER BY EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE DESC OPEN _at_X
FETCH _at_X INTO @PREV_EMPLOYEE, @PREV_JOB, @PREV_STEP, @PREV_MACHINE, _at_PREV_TX_TIME, @PREV_TX_TYPE

WHILE _at__at_FETCH_STATUS = 0
BEGIN
FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_JOB, @NEXT_STEP, @NEXT_MACHINE, _at_NEXT_TX_TIME, @NEXT_TX_TYPE
IF (_at__at_FETCH_STATUS != 0

   OR _at_PREV_EMPLOYEE != @NEXT_EMPLOYEE
   OR _at_PREV_JOB != @NEXT_JOB
   OR _at_PREV_STEP != @NEXT_STEP
   OR _at_PREV_MACHINE != @NEXT_MACHINE)
  AND _at_PREV_TX_TYPE = 'ON'

INSERT _at_P VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME, @PREV_JOB, _at_PREV_STEP, @PREV_MACHINE)
ELSE IF _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE

    AND _at_PREV_JOB = @NEXT_JOB
    AND _at_PREV_STEP = @NEXT_STEP
    AND _at_PREV_MACHINE = @NEXT_MACHINE
    AND _at_PREV_TX_TYPE = 'ON'
    AND _at_NEXT_TX_TYPE = 'OFF'

INSERT _at_P VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @NEXT_TX_TIME, @PREV_JOB,

_at_PREV_STEP, @PREV_MACHINE)
 SELECT _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE,
     _at_PREV_JOB = @NEXT_JOB,
     _at_PREV_STEP = @NEXT_STEP,
     _at_PREV_MACHINE = @NEXT_MACHINE,
     _at_PREV_TX_TIME = @NEXT_TX_TIME,
     _at_PREV_TX_TYPE = @NEXT_TX_TYPE

END
CLOSE _at_X
DEALLOCATE _at_X

Finally join the two table variables and aggregate to produce the result. It should be noted that the other half of the Table Spool/Eager Spool optimizer steps referred to above would feed a similar join in the set-based solution.
It should also be noted that the table variables are already in the correct order
for the join, due to the primary key constraint on each. I think the set-based
solution might require an extra sort, though as is often the case, I may be wrong.

*/
SELECT JOB#, STEP#, SUM(CAST(DATEDIFF(ss, a.START_TIME, a.STOP_TIME) AS FLOAT)/(3600.0 * CAST(NUMJOBS AS FLOAT))) AS LABOR FROM _at_I a JOIN @P b
  ON (b.EMPLOYEE# = a.EMPLOYEE#
   AND b.START_TIME < a.STOP_TIME
   AND b.STOP_TIME > a.START_TIME)
GROUP BY JOB#, STEP# --Here's the further annotated set-based solution:

/*
This view does with a correlated subquery combined with aggregation exactly what the cursor that produces _at_P does in the set-based solution.

I acknowledge that there may be another, better way to do this that doesn't generate a Loop Join. I used this method because I feel that it is easier to read and understand.
*/
CREATE VIEW PAIRED_TX AS

SELECT a.EMPLOYEE#,
       a.JOB#,
       a.STEP#,
       a.MACHINE#,
       TX_TIME AS TIME_ON,
       COALESCE(
         (SELECT MIN(TX_TIME)
            FROM TX
            WHERE EMPLOYEE# = a.EMPLOYEE#
              AND JOB# = a.JOB#
              AND STEP# = a.STEP#
              AND MACHINE# = a.MACHINE#
              AND TX_TIME > a.TX_TIME
              AND TX_TYPE = 'OFF'),
       CURRENT_TIMESTAMP) AS TIME_OFF

FROM TX a
WHERE TX_TYPE = 'ON' /*
This view simply returns a set of intervals between successive *sets* of rows.
I highlight *sets* here because it is possible for there to be multiple rows for
an employee with the same TX_TIME. In other words, an employee can clock onto multiple jobs at the same instant.

Note: this was acutally a requirement in the original solution: if the same employee clocked onto or off of more than one job during the same 3 minute period, then the clock time of the first job applied to them all. */
CREATE VIEW INTERVALS AS
SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME

FROM

      (SELECT DISTINCT a.EMPLOYEE#,
          a.TX_TIME AS START_TIME,
          (SELECT MIN(TX_TIME)
            FROM TX
            WHERE EMPLOYEE# = a.EMPLOYEE#
              AND TX_TIME > a.TX_TIME) AS STOP_TIME
         FROM TX a
       UNION
       SELECT EMPLOYEE#,
           MAX(TX_TIME) AS START_TIME,
           CURRENT_TIMESTAMP AS STOP_TIME
       FROM TX
       GROUP BY EMPLOYEE#
       HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) a

WHERE STOP_TIME IS NOT NULL SELECT JOB#,

       STEP#,
       SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS) /

3600.0 AS LABOR
FROM
/*
The following self-join calculates the number of jobs an employee is on during any interval.
This section of the query is replaced by the cursor that produces table variable _at_I in the procedural solution. Note that procedural solution doesn't need an additional scan to deal with unpaired jobs
(jobs that an employee is currently working on), nor an additional two scans to determine the intervals
during which an employee is working on a set of jobs, nor an additional two scans to determine
'ON' - 'OFF' pairings for each job that an employee was or is working on. */

(SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS PAIRS

      FROM PAIRED_TX a
        JOIN INTERVALS b
          ON (b.EMPLOYEE# = a.EMPLOYEE#
              AND b.START_TIME >= a.TIME_ON
              AND b.STOP_TIME <= a.TIME_OFF)
      GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) c

JOIN
/*
The following self-join splits the jobs up so that there is one row per employee, job, step, machine, and interval. This is replaced by the cursor that produces _at_P in the procedural solution. Some of the remarks from above apply here as well. Part of the difficulty, and the reason this is necessary
is due to the fact that an employee could be still working on some jobs at the current time.
*/

(SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME, STOP_TIME

          FROM PAIRED_TX a
            JOIN INTERVALS b
              ON (b.EMPLOYEE# = a.EMPLOYEE#
                  AND b.START_TIME >= a.TIME_ON
                  AND b.STOP_TIME <= a.TIME_OFF)) d
      ON (d.EMPLOYEE# = c.EMPLOYEE#
          AND d.START_TIME = c.START_TIME
          AND d.STOP_TIME = c.STOP_TIME)

/*
Finally, the result is joined and aggregated to produce the result. Here's where the results of the above subqueries would need to be sorted, or hashes computed (but not so likely, due to the theta joins involved-- it all depends on what the optimizer produces) in order to complete the join. Then the result of the join would need to be sorted or hashed again in order to compute the aggregate. Of course, the final sort/hash step would be necessary in either solution. */
GROUP BY JOB#, STEP# Received on Wed May 09 2007 - 18:15:01 CEST

This message: [ Message body ]
Next message: Cimode: "Re: how different indexing techniques work as dimensionality increases?"
Previous message: ErkanY: "how different indexing techniques work as dimensionality increases?"
Maybe in reply to: David Cressey: "Self Joins and optimization"
In reply to David Cressey: "Re: Self Joins and optimization"
Next in thread: Brian Selzer: "Re: Self Joins and optimization"
Reply: Brian Selzer: "Re: Self Joins and optimization"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message