Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Mon, 07 May 2007 03:27:48 GMT
Message-ID: <U0x%h.316$y_7.72_at_newssvr27.news.prodigy.net>


"Brian Selzer" <brian_at_selzer-software.com> wrote in message news:JBe%h.16839$YL5.304_at_newssvr29.news.prodigy.net...
>
> "David Cressey" <cressey73_at_verizon.net> wrote in message
> news:Vg5%h.166$83.81_at_trndny08...
>>
>> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
>> news:MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net...
>>>
>>> "David Cressey" <cressey73_at_verizon.net> wrote in message
>>> news:ZE%_h.163$D52.68_at_trndny04...
>>> > In the "interpolation" thread, Brian has been expounding on the idea
>> that
>>> > he can discover algorithms that he knows to be correct, and that
>>> > outperform
>>> > anything an optimizer can generate. He's mentioned "self joins" and
>>> > the
>>> > idea that he can get a job doen with only one read, where the
>>> > automatic
>>> > solution generated in response to declarative code generates multiple
>>> > passes
>>> > through the data.
>>> >
>>> > My experience leads me to the opposite conclusion from that.
>>> >
>>> > Here's the questions I'd like the newsgroup to consider:
>>> >
>>> > What, if any, algorithms that are demonstrably correct are going to be
>>> > overlooked by the optimizer, although discoverable by a human coder?
>> Why
>>> > would an optimizer fail to discover the possibility of eliminating
>>> > self-joins? Are there any undecidability issues here?
>>> >
>>> Here's a maybe not-so-simple example:
>>>
>>> Given a table with the columns,
>>>
>>> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>>>
>>> where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire
>> heading,
>>>
>>
>> I hope I can ask a bunch of questions without derailing your example. I
>> need to understand the example before I can come up with either a counter
>> argument, or agreement.
>>
>> First, a matter of simple syntax: What does "#" mean? is it part of
>> the
>> attribute name, or does it indicate something about the attribute? Is
>> this
>> SQL or some other language?
>>
> It's just part of the attribute name.
>>
>>> compute the total labor performed to date for each job step.
>>>
>>
>> What is a job step? How is it identified?
>>
> A job step is one of the steps required to complete a job, or work order.
> The distinction between a job and a work order is not material to the
> discussion. A job step is identified by the conjunction of a value for
> each of the attributes JOB# and STEP#.
>>
>>> Notes: (1) an employee cannot be perfoming the same job step on the same
>>> machine at the same time, but (2) intervals can overlap, and when they
>>> do,
>>> labor should be distributed between them evenly. For example, an
>>> employee
>>> may be monitoring two or more machines during the same period of time,
>>> so
>>> his labor during that period should be divided evenly between the job
>> steps
>>> on each machine, but that same employee cannot be monitoring the same
>>> job
>>> step on the same machine more than once during the same period of time.
>>>
>>
>> How do you guarantee that the data will not violate the constraints
>> declared
>> in the above notes? Are there declared constraints to prevent the
>> insertion
>> of cotradictory data, or is it just "careful programming" on the part
>> of
>> those who write the SW that writes the data to the DB?
>>
>
> I would probably use a trigger. It would be nice to be able to declare a
> temporal constraint, but I expect it would have to be applied to a view
> because of the nature of the data in the table. There is one other
> constraint that I didn't mention: for every row with 'OFF,' there must
> exist a row with 'ON' such that EMPLOYEE#, JOB#, STEP#, MACHINE# are the
> same and that TX_TIME in the row containing 'ON' < TX_TIME in the row
> containing 'OFF'.
>
>> If the constraints were declared, wouldn't this information be available
>> to
>> the optimizer?
>>
>
> I would assume so, I can't think of how either of the constraints could
> make a difference in this case.
>
>>
>>> A procedural approach would require two passes through the data: One to
>>> compute labor per job step per employee, and one to compute the
>>> aggregate,
>>> labor per job step.

I overlooked something. An employee can start or stop two or more job steps simultaneously, so in order to perform the query, more than a one row lookahead is required in order to calculate intervals. Computing labor per job step per employee *can* be done in two passes, making the total, three. In order to gain more than a single row lookahead, it's necessary to either process two firehose cursors simultaneously within a single fetch loop, or use a scroll cursor, which has the possibility of exceeding the equivalent of two scans.

I also grossly underestimated the number of passes needed for a set-based solution. The execution plan generated for the supplied query includes ten steps that either scan or seek the table directly, seven joins and three sorts. There's also this Table Spool/Lazy Spool step that stores intermediate data in a temp table.

>>>
>>> A set-based approach would require a minimum of six passes through the
>> data:
>>> Determining the intervals during which each employee was performing a
>>> set
>> of
>>> job steps requires a left outer self-join, or two passes through the
>>> data.
>>> Computing the cardinality of each of those sets so that labor can be
>>> distributed evenly requires an aggregation and then a join--three
>> additional
>>> passes. Computing the aggregate, labor per job step, requires a final
>> pass.
>>>
>>
>> Could you provide SQL for a set based query that computes the correct
>> result? This would help me to understand the example better.
>>
>
> In order to make things easier to follow, instead of a single query, I
> split the query up by using a couple views.
>
> The table definition:
>
> CREATE TABLE TX
> (
> EMPLOYEE# INT NOT NULL,
> JOB# INT NOT NULL,
> STEP# INT NOT NULL,
> MACHINE# INT NOT NULL,
> TX_TYPE CHAR(3) NOT NULL
> CHECK (TX_TYPE = 'ON' OR TX_TYPE = 'OFF'),
> TX_TIME DATETIME NOT NULL,
> PRIMARY KEY (EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TYPE, TX_TIME)
> )
>
> A little bit of test data:
>
> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 08:00:00')
> INSERT TX VALUES (1,2,1,1,'ON', '2007-05-03 08:00:00')
> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 09:00:00')
> INSERT TX VALUES (1,2,1,1,'OFF', '2007-05-03 09:30:00')
> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 10:00:00')
> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 11:00:00')
> INSERT TX VALUES (1,1,2,1,'ON', '2007-05-03 11:30:00')
> INSERT TX VALUES (1,1,2,1,'OFF', '2007-05-03 12:00:00')
> INSERT TX VALUES (1,2,2,1,'ON', '2007-05-03 13:00:00')
> INSERT TX VALUES (1,2,2,1,'OFF', '2007-05-03 14:00:00')
>
> This view pairs up the rows in TX. For pairs without an 'OFF' row,
> substitute the current time for TIME_OFF:
>
> CREATE VIEW PAIRED_TX AS
> SELECT a.EMPLOYEE#,
> a.JOB#,
> a.STEP#,
> a.MACHINE#,
> TX_TIME AS TIME_ON,
> COALESCE(
> (SELECT MIN(TX_TIME)
> FROM TX
> WHERE EMPLOYEE# = a.EMPLOYEE#
> AND JOB# = a.JOB#
> AND STEP# = a.STEP#
> AND MACHINE# = a.MACHINE#
> AND TX_TIME > a.TX_TIME
> AND TX_TYPE = 'OFF'),
> CURRENT_TIMESTAMP) AS TIME_OFF
> FROM TX a
> WHERE TX_TYPE = 'ON'
>
> This view enumerates the intervals per employee. Note an additional row
> with STOP_TIME as the current time. This is needed to deal with unpaired
> rows in TX.
>
> CREATE VIEW INTERVALS AS
> SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME
> FROM
> (SELECT DISTINCT a.EMPLOYEE#,
> a.TX_TIME AS START_TIME,
> (SELECT MIN(TX_TIME)
> FROM TX
> WHERE EMPLOYEE# = a.EMPLOYEE#
> AND TX_TIME > a.TX_TIME) AS STOP_TIME
> FROM TX a
> UNION
> SELECT EMPLOYEE#,
> MAX(TX_TIME) AS START_TIME,
> CURRENT_TIMESTAMP AS STOP_TIME
> FROM TX
> GROUP BY EMPLOYEE#
> HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) a
> WHERE STOP_TIME IS NOT NULL
>
> Here's the query: Sorry about the formatting. If you paste it into your
> favorite fixed font editor, it will be more readable. You may need a
> substitute for DATEDIFF for your favorite DBMS. This works on Sql Server.
>
> SELECT JOB#,
> STEP#,
> SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS)
> / 3600.0 AS LABOR
> FROM
> --this part of the query calculates the number of paired rows that occupy
> each interval
> (SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS
> PAIRS
> FROM PAIRED_TX a
> JOIN INTERVALS b
> ON (b.EMPLOYEE# = a.EMPLOYEE#
> AND b.START_TIME >= a.TIME_ON
> AND b.STOP_TIME <= a.TIME_OFF)
> GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) c
> JOIN
> --this part of the query splits each pair into individual intervals.
> (SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME,
> STOP_TIME
> FROM PAIRED_TX a
> JOIN INTERVALS b
> ON (b.EMPLOYEE# = a.EMPLOYEE#
> AND b.START_TIME >= a.TIME_ON
> AND b.STOP_TIME <= a.TIME_OFF)) d
> ON (d.EMPLOYEE# = c.EMPLOYEE#
> AND d.START_TIME = c.START_TIME
> AND d.STOP_TIME = c.STOP_TIME)
> GROUP BY JOB#, STEP#
>
> The results from the supplied test data:
> JOB# STEP# LABOR
> -------- -------- ----------
> 1 1 1.5
> 2 1 1.0
> 1 2 0.5
> 2 2 1.0
>
>>> Since the final pass is common to both approaches, the procedural
>>> approach
>>> effectively replaces five passes through the data with a single pass.
>>>
>>> Why would the optimizer fail? Because I don't think it's possible for
>>> the
>>> optimizer to detect every intrarelational dependency in a table, nor do
>>> I
>>> think it would be advisable, unless it were possible to enumerate them
>>> in
>>> the schema.
>>>
>> Why would you want to have dependencies in the data that were NOT
>> enumerated
>> in the schema (assuming I understand what you mean by "enumerated")?
>> Deosn't the schema describe the data?
>>
>
> There may be an implicit order to the data, such as is the case with
> temporal data.
>
>>
>> I hope it's not too much trouble to clear these issues up first. Before
>> we
>> get to the meat of it.
>>
>>
>>
>
>
Received on Mon May 07 2007 - 05:27:48 CEST

Original text of this message