Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Wed, 09 May 2007 20:56:02 GMT
Message-ID: <Czq0i.5322$H_.2613_at_newssvr21.news.prodigy.net>

Please ignore this last post. I hadn't thought this through completely. At least one of the tables would need to be sorted, because one of the results is *not* in the correct order to join. Also, after further consideration, I've determined that the additional overhead of maintaining the indexes should clearly not exceed on average 2 writes per insert, because the table that would need sorting is already partially sorted, so the worst case shouldn't ever come up, and therefore it doesn't make sense to introduce three additional scans--not to mention a sort.

"Brian Selzer" <brian_at_selzer-software.com> wrote in message news:z4n0i.20887$JZ3.12091_at_newssvr13.news.prodigy.net...
>I just wanted to add another thought. The table variables in the
>procedural solution have a primary key constraint, which in Sql Server
>creates a clustered index. Adding rows to a clustered index will incur
>overhead for index maintenance--minimal, since the data is inserted in
>order, but overhead nonetheless. This overhead can be further eliminated
>for very large n by omitting the primary key constraints and using
>additional cursors with the index hint INDEX(0) against the table
>variables. The data is inserted in each table variable in the correct
>order for joining, and INDEX(0) reads out the data using the same order in
>which it was inserted. The additional cursors would add (I think) at most 5
>additional scans through the data, but for very large n, it might be
>beneficial since this step can then be accomplished in linear time. I
>didn't include that solution here, for a number of reasons: brevity,
>clarity, the undocumented and implementation-specific nature of this use of
>the INDEX(0) hint, because I'm not absolutely certain if the overhead
>incurred will ever require more than two writes per insert for sorted
>input, and mainly because I think the additional overhead is offset by a
>reduction in the number of sorts required to produce the result.
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:9sm0i.8051$rO7.15_at_newssvr25.news.prodigy.net...
>>
>> "David Cressey" <cressey73_at_verizon.net> wrote in message
>> news:uBj0i.14554$rm.10970_at_trndny03...
>> [snip]
>>>
>>> Brian,
>>>
>>> I'm not ignoring your example. I'm trying to digest it, a little bit
>>> at a
>>> time. As you said, it's not quite so simple. I have no response at
>>> this
>>> time, other than vague philosophical generalities. I think it would be
>>> insulting to offer such generalities in response to your input of code
>>> and
>>> data.
>>>
>>> So I'm hoping, before too long, to have something more specific to say.
>>>
>>> In the meantime, your example provides me much food for thought about
>>> temporal data, transition versus state, and simultaneity.
>>>
>>
>> Temporal data can really make you tear what little hair you have left
>> out.
>> I bought Date, Darwin and Lorentzos' /Temporal Data and the Relational
>> Model/ recently, and found that there really isn't much in it that I
>> hadn't
>> already figured out on my own. I also think some of their arguments
>> are lacking. Their arguments against using NOW for present data are
>> totally
>> unfounded, and I find their solution, splitting current and historical
>> data into
>> separate relvars abhorrent. First of all, an interval of the form [NOW,
>> d16]
>> does not make any sense in a closed world. Secondly, why couldn't there
>> be different kinds of intervals: definite, indefinite start, indefinite
>> end, and
>> universal. (They certainly allow for different granularities and
>> precision.)
>> In a closed world, an indefinite interval would have to refer to either
>> positive
>> or negative infinity, and even given the discreteness assumption, the
>> number
>> of possible timestamps is countably infinite. The introduction of
>> indefinite
>> intervals would eliminate the need for separate relvars, and would also
>> allow
>> invariant information to be recorded in the same relvar. The presence of
>> the
>> interval (-infinity,+infinity) would clearly indicate that the
>> information
>> represented by the tuple applies for all time, and as such, must never
>> change.
>> This could even be enforced by the engine. So their solution is the
>> result of
>> a faulty argument and a poorly designed interval data type.
>>
>> Sorry for diverging from the topic at hand. Below is a procedural
>> solution. It took a bit of extra time to write, because there are so many
>> conditions. Also, in order to make the set-based solution easier to
>> understand, I'm reincluding it here further annotated for comparison with
>> the procedural solution.
>>
>> /*
>> Need to cache CURRENT_TIMESTAMP so that apples and apples can be compared
>> */
>> DECLARE _at_CURRENT_TIME DATETIME SET @CURRENT_TIME = CURRENT_TIMESTAMP
>> DECLARE _at_NUMJOBS INT
>> DECLARE _at_PREV_EMPLOYEE INT,
>> _at_NEXT_EMPLOYEE INT,
>> _at_PREV_TX_TIME DATETIME,
>> _at_NEXT_TX_TIME DATETIME,
>> _at_PREV_TX_TYPE CHAR(3),
>> _at_NEXT_TX_TYPE CHAR(3),
>> _at_PREV_JOB INT,
>> _at_NEXT_JOB INT,
>> _at_PREV_STEP INT,
>> _at_NEXT_STEP INT,
>> _at_PREV_MACHINE INT,
>> _at_NEXT_MACHINE INT
>> DECLARE _at_X CURSOR
>>
>> /*
>> First build a set of intervals and calculate the number of jobs an
>> employee is working on during each.
>> Use the cached _at_CURRENT_TIME for the final interval if the employee is
>> currently working.
>> This takes one pass through the data in order to produce the table
>> variable _at_I.
>> Caching the result into a table variable and then reading it out in the
>> final join is similar to
>> a Table Spool/Eager Spool optimizer step. Which in my judgement would be
>> required in
>> a fully-optimized set-based solution as well.
>> */
>> EMPLOYEE#, TX_TIME, TX_TYPE
>> |
>> V
>> EMPLOYEE#, START_TIME, STOP_TIME, NUMJOBS
>> */
>> DECLARE _at_I TABLE
>> (
>> EMPLOYEE# INT NOT NULL,
>> START_TIME DATETIME NOT NULL,
>> STOP_TIME DATETIME NOT NULL,
>> NUMJOBS INT NOT NULL,
>> PRIMARY KEY (EMPLOYEE#, START_TIME, STOP_TIME)
>> )
>>
>> SET _at_X = CURSOR LOCAL STATIC FORWARD_ONLY FOR
>> SELECT EMPLOYEE#, TX_TIME, TX_TYPE
>> FROM TX
>> ORDER BY EMPLOYEE#, TX_TIME, TX_TYPE DESC
>>
>> OPEN _at_X
>> FETCH _at_X INTO @PREV_EMPLOYEE, @PREV_TX_TIME, @PREV_TX_TYPE
>> IF _at__at_FETCH_STATUS = 0
>> BEGIN
>> IF _at_PREV_TX_TYPE = 'ON'
>> BEGIN
>> SET _at_NUMJOBS = 1
>> FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_TX_TIME, @NEXT_TX_TYPE
>> WHILE _at__at_FETCH_STATUS = 0
>> BEGIN
>> IF _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE
>> BEGIN
>> IF _at_PREV_TX_TIME < @NEXT_TX_TIME
>> IF _at_NUMJOBS > 0
>> INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @NEXT_TX_TIME,
>> _at_NUMJOBS)
>> SET _at_NUMJOBS = @NUMJOBS + CASE WHEN @NEXT_TX_TYPE = 'ON' THEN 1
>> ELSE -1 END
>> END ELSE BEGIN
>> IF _at_NUMJOBS > 0
>> INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME,
>> _at_NUMJOBS)
>> SET _at_NUMJOBS = 0
>> END
>> SELECT _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE,
>> _at_PREV_TX_TIME = @NEXT_TX_TIME,
>> _at_PREV_TX_TYPE = @NEXT_TX_TYPE
>> FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_TX_TIME, @NEXT_TX_TYPE
>> END
>> IF _at_NUMJOBS > 0
>> INSERT _at_I VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME,
>> _at_NUMJOBS)
>> END
>> END
>> CLOSE _at_X
>> DEALLOCATE _at_X
>>
>> /*
>> Now pair up the rows so that there is only one row per employee, job,
>> step, machine, and interval.
>> Use the cached _at_CURRENT_TIME to complete the intervals if necessary.
>> This takes one pass through the data to produce the table variable _at_P.
>> Again, in my judgement, the Table Spool/Eager Spool optimizer step
>> would exist in a fully optimized set-based solution.
>> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>> |
>> V
>> EMPLOYEE#, START_TIME, STOP_TIME, JOB#, STEP#, MACHINE#
>> */
>> DECLARE _at_P TABLE
>> (
>> EMPLOYEE# INT NOT NULL,
>> START_TIME DATETIME NOT NULL,
>> STOP_TIME DATETIME NOT NULL,
>> JOB# INT NOT NULL,
>> STEP# INT NOT NULL,
>> MACHINE# INT NOT NULL,
>> PRIMARY KEY (EMPLOYEE#, START_TIME, STOP_TIME, JOB#, STEP#, MACHINE#)
>> )
>>
>> SET _at_X = CURSOR LOCAL FORWARD_ONLY FOR
>> SELECT EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>> FROM TX
>> ORDER BY EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE DESC
>>
>> OPEN _at_X
>> FETCH _at_X INTO @PREV_EMPLOYEE, @PREV_JOB, @PREV_STEP, @PREV_MACHINE,
>> _at_PREV_TX_TIME, @PREV_TX_TYPE
>>
>> WHILE _at__at_FETCH_STATUS = 0
>> BEGIN
>> FETCH _at_X INTO @NEXT_EMPLOYEE, @NEXT_JOB, @NEXT_STEP, @NEXT_MACHINE,
>> _at_NEXT_TX_TIME, @NEXT_TX_TYPE
>> IF (_at__at_FETCH_STATUS != 0
>> OR _at_PREV_EMPLOYEE != @NEXT_EMPLOYEE
>> OR _at_PREV_JOB != @NEXT_JOB
>> OR _at_PREV_STEP != @NEXT_STEP
>> OR _at_PREV_MACHINE != @NEXT_MACHINE)
>> AND _at_PREV_TX_TYPE = 'ON'
>> INSERT _at_P VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @CURRENT_TIME,
>> _at_PREV_JOB, @PREV_STEP, @PREV_MACHINE)
>> ELSE IF _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE
>> AND _at_PREV_JOB = @NEXT_JOB
>> AND _at_PREV_STEP = @NEXT_STEP
>> AND _at_PREV_MACHINE = @NEXT_MACHINE
>> AND _at_PREV_TX_TYPE = 'ON'
>> AND _at_NEXT_TX_TYPE = 'OFF'
>> INSERT _at_P VALUES (@PREV_EMPLOYEE, @PREV_TX_TIME, @NEXT_TX_TIME,
>> _at_PREV_JOB, @PREV_STEP, @PREV_MACHINE)
>> SELECT _at_PREV_EMPLOYEE = @NEXT_EMPLOYEE,
>> _at_PREV_JOB = @NEXT_JOB,
>> _at_PREV_STEP = @NEXT_STEP,
>> _at_PREV_MACHINE = @NEXT_MACHINE,
>> _at_PREV_TX_TIME = @NEXT_TX_TIME,
>> _at_PREV_TX_TYPE = @NEXT_TX_TYPE
>> END
>> CLOSE _at_X
>> DEALLOCATE _at_X
>>
>> /*
>>
>> Finally join the two table variables and aggregate to produce the result.
>> It should be noted that the other half of the Table Spool/Eager Spool
>> optimizer steps referred to above would feed a similar join in the
>> set-based solution.
>> It should also be noted that the table variables are already in the
>> correct order
>> for the join, due to the primary key constraint on each. I think the
>> set-based
>> solution might require an extra sort, though as is often the case, I may
>> be wrong.
>>
>> */
>> SELECT JOB#, STEP#, SUM(CAST(DATEDIFF(ss, a.START_TIME, a.STOP_TIME) AS
>> FLOAT)/(3600.0 * CAST(NUMJOBS AS FLOAT))) AS LABOR
>> FROM _at_I a JOIN @P b
>> ON (b.EMPLOYEE# = a.EMPLOYEE#
>> AND b.START_TIME < a.STOP_TIME
>> AND b.STOP_TIME > a.START_TIME)
>> GROUP BY JOB#, STEP#
>>
>> --Here's the further annotated set-based solution:
>>
>> /*
>> This view does with a correlated subquery combined with aggregation
>> exactly what the cursor that produces _at_P does in the set-based solution.
>>
>> I acknowledge that there may be another, better way to do this that
>> doesn't generate a Loop Join. I used this method because I feel that it
>> is easier to read and understand.
>> */
>> CREATE VIEW PAIRED_TX AS
>> SELECT a.EMPLOYEE#,
>> a.JOB#,
>> a.STEP#,
>> a.MACHINE#,
>> TX_TIME AS TIME_ON,
>> COALESCE(
>> (SELECT MIN(TX_TIME)
>> FROM TX
>> WHERE EMPLOYEE# = a.EMPLOYEE#
>> AND JOB# = a.JOB#
>> AND STEP# = a.STEP#
>> AND MACHINE# = a.MACHINE#
>> AND TX_TIME > a.TX_TIME
>> AND TX_TYPE = 'OFF'),
>> CURRENT_TIMESTAMP) AS TIME_OFF
>> FROM TX a
>> WHERE TX_TYPE = 'ON'
>>
>> /*
>> This view simply returns a set of intervals between successive *sets* of
>> rows.
>> I highlight *sets* here because it is possible for there to be multiple
>> rows for
>> an employee with the same TX_TIME. In other words, an employee can
>> clock onto multiple jobs at the same instant.
>>
>> Note: this was acutally a requirement in the original solution: if the
>> same
>> employee clocked onto or off of more than one job during the same
>> 3 minute period, then the clock time of the first job applied to them
>> all.
>> */
>> CREATE VIEW INTERVALS AS
>> SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME
>> FROM
>> (SELECT DISTINCT a.EMPLOYEE#,
>> a.TX_TIME AS START_TIME,
>> (SELECT MIN(TX_TIME)
>> FROM TX
>> WHERE EMPLOYEE# = a.EMPLOYEE#
>> AND TX_TIME > a.TX_TIME) AS STOP_TIME
>> FROM TX a
>> UNION
>> SELECT EMPLOYEE#,
>> MAX(TX_TIME) AS START_TIME,
>> CURRENT_TIMESTAMP AS STOP_TIME
>> FROM TX
>> GROUP BY EMPLOYEE#
>> HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) a
>> WHERE STOP_TIME IS NOT NULL
>>
>>
>> SELECT JOB#,
>> STEP#,
>> SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS)
>> / 3600.0 AS LABOR
>> FROM
>> /*
>> The following self-join calculates the number of jobs an employee is on
>> during any interval.
>> This section of the query is replaced by the cursor that produces table
>> variable _at_I in the procedural solution.
>> Note that procedural solution doesn't need an additional scan to deal
>> with unpaired jobs
>> (jobs that an employee is currently working on), nor an additional two
>> scans to determine the intervals
>> during which an employee is working on a set of jobs, nor an additional
>> two scans to determine
>> 'ON' - 'OFF' pairings for each job that an employee was or is working on.
>> */
>> (SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS
>> PAIRS
>> FROM PAIRED_TX a
>> JOIN INTERVALS b
>> ON (b.EMPLOYEE# = a.EMPLOYEE#
>> AND b.START_TIME >= a.TIME_ON
>> AND b.STOP_TIME <= a.TIME_OFF)
>> GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) c
>> JOIN
>> /*
>> The following self-join splits the jobs up so that there is one row per
>> employee, job, step, machine, and interval.
>> This is replaced by the cursor that produces _at_P in the procedural
>> solution.
>> Some of the remarks from above apply here as well. Part of the
>> difficulty, and the reason this is necessary
>> is due to the fact that an employee could be still working on some jobs
>> at the current time.
>> */
>> (SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME,
>> STOP_TIME
>> FROM PAIRED_TX a
>> JOIN INTERVALS b
>> ON (b.EMPLOYEE# = a.EMPLOYEE#
>> AND b.START_TIME >= a.TIME_ON
>> AND b.STOP_TIME <= a.TIME_OFF)) d
>> ON (d.EMPLOYEE# = c.EMPLOYEE#
>> AND d.START_TIME = c.START_TIME
>> AND d.STOP_TIME = c.STOP_TIME)
>>
>> /*
>> Finally, the result is joined and aggregated to produce the result.
>> Here's where the results of the above subqueries would need to be sorted,
>> or hashes computed (but not so likely, due to the theta joins involved--
>> it all depends on what the optimizer produces) in order to complete
>> the join. Then the result of the join would need to be sorted or hashed
>> again in order to compute the aggregate. Of course, the final sort/hash
>> step would be necessary in either solution.
>> */
>> GROUP BY JOB#, STEP#
>>
>>
>
>
Received on Wed May 09 2007 - 22:56:02 CEST

This message: [ Message body ]
Next message: Brian Selzer: "Re: more closed-world chatter"
Previous message: ErkanY: "Re: how different indexing techniques work as dimensionality increases?"
Maybe in reply to: David Cressey: "Self Joins and optimization"
In reply to Brian Selzer: "Re: Self Joins and optimization"
Next in thread: Brian Selzer: "Re: more closed-world chatter"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message