Re: Self Joins and optimization
Date: Wed, 09 May 2007 16:15:01 GMT
Message-ID: <9sm0i.8051$rO7.15_at_newssvr25.news.prodigy.net>
"David Cressey" <cressey73_at_verizon.net> wrote in message
news:uBj0i.14554$rm.10970_at_trndny03...
[snip]
>
> Brian,
>
> I'm not ignoring your example. I'm trying to digest it, a little bit at
> a
> time. As you said, it's not quite so simple. I have no response at this
> time, other than vague philosophical generalities. I think it would be
> insulting to offer such generalities in response to your input of code and
> data.
>
> So I'm hoping, before too long, to have something more specific to say.
>
> In the meantime, your example provides me much food for thought about
> temporal data, transition versus state, and simultaneity.
>
Temporal data can really make you tear what little hair you have left out.
I bought Date, Darwin and Lorentzos' /Temporal Data and the Relational
Model/ recently, and found that there really isn't much in it that I hadn't
already figured out on my own. I also think some of their arguments
are lacking. Their arguments against using NOW for present data are totally
unfounded, and I find their solution, splitting current and historical data
into
separate relvars abhorrent. First of all, an interval of the form [NOW,
d16]
does not make any sense in a closed world. Secondly, why couldn't there
be different kinds of intervals: definite, indefinite start, indefinite end,
and
universal. (They certainly allow for different granularities and
precision.)
In a closed world, an indefinite interval would have to refer to either
positive
or negative infinity, and even given the discreteness assumption, the number
of possible timestamps is countably infinite. The introduction of
indefinite
intervals would eliminate the need for separate relvars, and would also
allow
invariant information to be recorded in the same relvar. The presence of the
interval (-infinity,+infinity) would clearly indicate that the information
represented by the tuple applies for all time, and as such, must never
change.
This could even be enforced by the engine. So their solution is the result
of
a faulty argument and a poorly designed interval data type.
Sorry for diverging from the topic at hand. Below is a procedural solution. It took a bit of extra time to write, because there are so many conditions. Also, in order to make the set-based solution easier to understand, I'm reincluding it here further annotated for comparison with the procedural solution.
/*
Need to cache CURRENT_TIMESTAMP so that apples and apples can be compared
*/
DECLARE _at_CURRENT_TIME DATETIME SET @CURRENT_TIME = CURRENT_TIMESTAMP DECLARE _at_NUMJOBS INT DECLARE _at_PREV_EMPLOYEE INT, _at_NEXT_EMPLOYEE INT, _at_PREV_TX_TIME DATETIME, _at_NEXT_TX_TIME DATETIME, _at_PREV_TX_TYPE CHAR(3), _at_NEXT_TX_TYPE CHAR(3), _at_PREV_JOB INT, _at_NEXT_JOB INT, _at_PREV_STEP INT, _at_NEXT_STEP INT,
_at_PREV_MACHINE INT,
_at_NEXT_MACHINE INT
DECLARE _at_X CURSOR
/*
First build a set of intervals and calculate the number of jobs an employee
is working on during each.
Use the cached _at_CURRENT_TIME for the final interval if the employee is
currently working.
This takes one pass through the data in order to produce the table variable
_at_I.
Caching the result into a table variable and then reading it out in the
final join is similar to
a Table Spool/Eager Spool optimizer step. Which in my judgement would be
required in
a fully-optimized set-based solution as well.
*/
EMPLOYEE#, TX_TIME, TX_TYPE
| V
EMPLOYEE#, START_TIME, STOP_TIME, NUMJOBS */
DECLARE _at_I TABLE
(
EMPLOYEE# INT NOT NULL,
START_TIME DATETIME NOT NULL,
STOP_TIME DATETIME NOT NULL,
NUMJOBS INT NOT NULL,
PRIMARY KEY (EMPLOYEE#, START_TIME, STOP_TIME) )
SET _at_X = CURSOR LOCAL STATIC FORWARD_ONLY FOR
SELECT EMPLOYEE#, TX_TIME, TX_TYPE
FROM TX
ORDER BY EMPLOYEE#, TX_TIME, TX_TYPE DESC
OPEN _at_X
FETCH _at_X INTO @PREV_EMPLOYEE, @PREV_TX_TIME, @PREV_TX_TYPE
IF _at__at_FETCH_STATUS = 0
BEGIN
IF _at_PREV_TX_TYPE = 'ON'
BEGIN
SET _at_NUMJOBS = 1
FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_TX_TIME, @NEXT_TX_TYPE
WHILE _at__at_FETCH_STATUS = 0
BEGIN
IF _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE
BEGIN
IF _at_PREV_TX_TIME < @NEXT_TX_TIME
IF _at_NUMJOBS > 0 INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @NEXT_TX_TIME,_at_NUMJOBS)
SET _at_NUMJOBS = @NUMJOBS + CASE WHEN @NEXT_TX_TYPE = 'ON' THEN 1 ELSE -1
END
END ELSE BEGIN
IF _at_NUMJOBS > 0
INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME,
_at_NUMJOBS)
SET _at_NUMJOBS = 0
END
SELECT _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE, _at_PREV_TX_TIME = @NEXT_TX_TIME, _at_PREV_TX_TYPE = @NEXT_TX_TYPE
FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_TX_TIME, @NEXT_TX_TYPE END
IF _at_NUMJOBS > 0
INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME, @NUMJOBS) END
END
CLOSE _at_X
DEALLOCATE _at_X
/*
Now pair up the rows so that there is only one row per employee, job, step,
machine, and interval.
Use the cached _at_CURRENT_TIME to complete the intervals if necessary.
This takes one pass through the data to produce the table variable _at_P.
Again, in my judgement, the Table Spool/Eager Spool optimizer step
would exist in a fully optimized set-based solution.
EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
| V
EMPLOYEE#, START_TIME, STOP_TIME, JOB#, STEP#, MACHINE# */
DECLARE _at_P TABLE
(
EMPLOYEE# INT NOT NULL,
START_TIME DATETIME NOT NULL,
STOP_TIME DATETIME NOT NULL,
JOB# INT NOT NULL,
STEP# INT NOT NULL,
MACHINE# INT NOT NULL,
PRIMARY KEY (EMPLOYEE#, START_TIME, STOP_TIME, JOB#, STEP#, MACHINE#) )
SET _at_X = CURSOR LOCAL FORWARD_ONLY FOR
SELECT EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
FROM TX
ORDER BY EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE DESC
OPEN _at_X
FETCH _at_X INTO @PREV_EMPLOYEE, @PREV_JOB, @PREV_STEP, @PREV_MACHINE,
_at_PREV_TX_TIME, @PREV_TX_TYPE
WHILE _at__at_FETCH_STATUS = 0
BEGIN
FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_JOB, @NEXT_STEP, @NEXT_MACHINE,
_at_NEXT_TX_TIME, @NEXT_TX_TYPE
IF (_at__at_FETCH_STATUS != 0
OR _at_PREV_EMPLOYEE != @NEXT_EMPLOYEE OR _at_PREV_JOB != @NEXT_JOB OR _at_PREV_STEP != @NEXT_STEP OR _at_PREV_MACHINE != @NEXT_MACHINE) AND _at_PREV_TX_TYPE = 'ON'
INSERT _at_P VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME, @PREV_JOB, _at_PREV_STEP, @PREV_MACHINE)
ELSE IF _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE
AND _at_PREV_JOB = @NEXT_JOB AND _at_PREV_STEP = @NEXT_STEP AND _at_PREV_MACHINE = @NEXT_MACHINE AND _at_PREV_TX_TYPE = 'ON' AND _at_NEXT_TX_TYPE = 'OFF'
INSERT _at_P VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @NEXT_TX_TIME, @PREV_JOB,
_at_PREV_STEP, @PREV_MACHINE) SELECT _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE, _at_PREV_JOB = @NEXT_JOB, _at_PREV_STEP = @NEXT_STEP, _at_PREV_MACHINE = @NEXT_MACHINE, _at_PREV_TX_TIME = @NEXT_TX_TIME, _at_PREV_TX_TYPE = @NEXT_TX_TYPE
END
CLOSE _at_X
DEALLOCATE _at_X
/*
Finally join the two table variables and aggregate to produce the result.
It should be noted that the other half of the Table Spool/Eager Spool
optimizer steps referred to above would feed a similar join in the set-based
solution.
It should also be noted that the table variables are already in the correct
order
for the join, due to the primary key constraint on each. I think the
set-based
solution might require an extra sort, though as is often the case, I may be
wrong.
*/
SELECT JOB#, STEP#, SUM(CAST(DATEDIFF(ss, a.START_TIME, a.STOP_TIME) AS
FLOAT)/(3600.0 * CAST(NUMJOBS AS FLOAT))) AS LABOR
FROM _at_I a JOIN @P b
ON (b.EMPLOYEE# = a.EMPLOYEE#
AND b.START_TIME < a.STOP_TIME
AND b.STOP_TIME > a.START_TIME)
GROUP BY JOB#, STEP#
--Here's the further annotated set-based solution:
/*
This view does with a correlated subquery combined with aggregation
exactly what the cursor that produces _at_P does in the set-based solution.
I acknowledge that there may be another, better way to do this that doesn't
generate a Loop Join. I used this method because I feel that it is easier
to read and understand.
*/
CREATE VIEW PAIRED_TX AS
SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, TX_TIME AS TIME_ON, COALESCE( (SELECT MIN(TX_TIME) FROM TX WHERE EMPLOYEE# = a.EMPLOYEE# AND JOB# = a.JOB# AND STEP# = a.STEP# AND MACHINE# = a.MACHINE# AND TX_TIME > a.TX_TIME AND TX_TYPE = 'OFF'), CURRENT_TIMESTAMP) AS TIME_OFF
FROM TX a
WHERE TX_TYPE = 'ON' /*
This view simply returns a set of intervals between successive *sets* of rows.
I highlight *sets* here because it is possible for there to be multiple rows for
an employee with the same TX_TIME. In other words, an employee can clock onto multiple jobs at the same instant.
Note: this was acutally a requirement in the original solution: if the same
employee clocked onto or off of more than one job during the same
3 minute period, then the clock time of the first job applied to them all.
*/
CREATE VIEW INTERVALS AS
SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME
FROM
(SELECT DISTINCT a.EMPLOYEE#, a.TX_TIME AS START_TIME, (SELECT MIN(TX_TIME) FROM TX WHERE EMPLOYEE# = a.EMPLOYEE# AND TX_TIME > a.TX_TIME) AS STOP_TIME FROM TX a UNION SELECT EMPLOYEE#, MAX(TX_TIME) AS START_TIME, CURRENT_TIMESTAMP AS STOP_TIME FROM TX GROUP BY EMPLOYEE# HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) aWHERE STOP_TIME IS NOT NULL SELECT JOB#,
STEP#, SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS) /3600.0 AS LABOR
FROM
/*
The following self-join calculates the number of jobs an employee is on during any interval.
This section of the query is replaced by the cursor that produces table variable _at_I in the procedural solution. Note that procedural solution doesn't need an additional scan to deal with unpaired jobs
(jobs that an employee is currently working on), nor an additional two scans to determine the intervals
during which an employee is working on a set of jobs, nor an additional two scans to determine
'ON' - 'OFF' pairings for each job that an employee was or is working on. */
(SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS PAIRS
FROM PAIRED_TX a JOIN INTERVALS b ON (b.EMPLOYEE# = a.EMPLOYEE# AND b.START_TIME >= a.TIME_ON AND b.STOP_TIME <= a.TIME_OFF) GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) cJOIN
/*
The following self-join splits the jobs up so that there is one row per employee, job, step, machine, and interval. This is replaced by the cursor that produces _at_P in the procedural solution. Some of the remarks from above apply here as well. Part of the difficulty, and the reason this is necessary
is due to the fact that an employee could be still working on some jobs at the current time.
*/
(SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME, STOP_TIME
FROM PAIRED_TX a JOIN INTERVALS b ON (b.EMPLOYEE# = a.EMPLOYEE# AND b.START_TIME >= a.TIME_ON AND b.STOP_TIME <= a.TIME_OFF)) d ON (d.EMPLOYEE# = c.EMPLOYEE# AND d.START_TIME = c.START_TIME AND d.STOP_TIME = c.STOP_TIME)
/*
Finally, the result is joined and aggregated to produce the result.
Here's where the results of the above subqueries would need to be sorted,
or hashes computed (but not so likely, due to the theta joins involved--
it all depends on what the optimizer produces) in order to complete
the join. Then the result of the join would need to be sorted or hashed
again in order to compute the aggregate. Of course, the final sort/hash
step would be necessary in either solution.
*/
GROUP BY JOB#, STEP#
Received on Wed May 09 2007 - 18:15:01 CEST