Re: Self Joins and optimization
Date: Sun, 06 May 2007 06:30:01 GMT
Message-ID: <JBe%h.16839$YL5.304_at_newssvr29.news.prodigy.net>
"David Cressey" <cressey73_at_verizon.net> wrote in message
news:Vg5%h.166$83.81_at_trndny08...
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net...
>>
>> "David Cressey" <cressey73_at_verizon.net> wrote in message
>> news:ZE%_h.163$D52.68_at_trndny04...
>> > In the "interpolation" thread, Brian has been expounding on the idea
> that
>> > he can discover algorithms that he knows to be correct, and that
>> > outperform
>> > anything an optimizer can generate. He's mentioned "self joins" and
>> > the
>> > idea that he can get a job doen with only one read, where the
>> > automatic
>> > solution generated in response to declarative code generates multiple
>> > passes
>> > through the data.
>> >
>> > My experience leads me to the opposite conclusion from that.
>> >
>> > Here's the questions I'd like the newsgroup to consider:
>> >
>> > What, if any, algorithms that are demonstrably correct are going to be
>> > overlooked by the optimizer, although discoverable by a human coder?
> Why
>> > would an optimizer fail to discover the possibility of eliminating
>> > self-joins? Are there any undecidability issues here?
>> >
>> Here's a maybe not-so-simple example:
>>
>> Given a table with the columns,
>>
>> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>>
>> where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire
> heading,
>>
>
> I hope I can ask a bunch of questions without derailing your example. I
> need to understand the example before I can come up with either a counter
> argument, or agreement.
>
> First, a matter of simple syntax: What does "#" mean? is it part of the
> attribute name, or does it indicate something about the attribute? Is
> this
> SQL or some other language?
>
It's just part of the attribute name.
>
>> compute the total labor performed to date for each job step.
>>
>
> What is a job step? How is it identified?
>
A job step is one of the steps required to complete a job, or work order.
The distinction between a job and a work order is not material to the
discussion. A job step is identified by the conjunction of a value for each
of the attributes JOB# and STEP#.
>
>> Notes: (1) an employee cannot be perfoming the same job step on the same
>> machine at the same time, but (2) intervals can overlap, and when they
>> do,
>> labor should be distributed between them evenly. For example, an
>> employee
>> may be monitoring two or more machines during the same period of time, so
>> his labor during that period should be divided evenly between the job
> steps
>> on each machine, but that same employee cannot be monitoring the same job
>> step on the same machine more than once during the same period of time.
>>
>
> How do you guarantee that the data will not violate the constraints
> declared
> in the above notes? Are there declared constraints to prevent the
> insertion
> of cotradictory data, or is it just "careful programming" on the part of
> those who write the SW that writes the data to the DB?
>
I would probably use a trigger. It would be nice to be able to declare a
temporal constraint, but I expect it would have to be applied to a view
because of the nature of the data in the table. There is one other
constraint that I didn't mention: for every row with 'OFF,' there must exist
a row with 'ON' such that EMPLOYEE#, JOB#, STEP#, MACHINE# are the same and
that TX_TIME in the row containing 'ON' < TX_TIME in the row containing
'OFF'.
> If the constraints were declared, wouldn't this information be available
> to
> the optimizer?
>
I would assume so, I can't think of how either of the constraints could
make a difference in this case.
>
>> A procedural approach would require two passes through the data: One to
>> compute labor per job step per employee, and one to compute the
>> aggregate,
>> labor per job step.
>>
>> A set-based approach would require a minimum of six passes through the
> data:
>> Determining the intervals during which each employee was performing a set
> of
>> job steps requires a left outer self-join, or two passes through the
>> data.
>> Computing the cardinality of each of those sets so that labor can be
>> distributed evenly requires an aggregation and then a join--three
> additional
>> passes. Computing the aggregate, labor per job step, requires a final
> pass.
>>
>
> Could you provide SQL for a set based query that computes the correct
> result? This would help me to understand the example better.
>
In order to make things easier to follow, instead of a single query, I split
the query up by using a couple views.
The table definition:
CREATE TABLE TX
A little bit of test data:
This view pairs up the rows in TX. For pairs without an 'OFF' row,
substitute the current time for TIME_OFF:
CREATE VIEW PAIRED_TX AS
CREATE VIEW INTERVALS AS
FROM
(
EMPLOYEE# INT NOT NULL,
JOB# INT NOT NULL,
STEP# INT NOT NULL,
MACHINE# INT NOT NULL,
TX_TYPE CHAR(3) NOT NULL
CHECK (TX_TYPE = 'ON' OR TX_TYPE = 'OFF'),
TX_TIME DATETIME NOT NULL,
PRIMARY KEY (EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TYPE, TX_TIME)
)
INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 08:00:00')
INSERT TX VALUES (1,2,1,1,'ON', '2007-05-03 08:00:00')
INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 09:00:00')
INSERT TX VALUES (1,2,1,1,'OFF', '2007-05-03 09:30:00')
INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 10:00:00')
INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 11:00:00')
INSERT TX VALUES (1,1,2,1,'ON', '2007-05-03 11:30:00')
INSERT TX VALUES (1,1,2,1,'OFF', '2007-05-03 12:00:00')
INSERT TX VALUES (1,2,2,1,'ON', '2007-05-03 13:00:00')
INSERT TX VALUES (1,2,2,1,'OFF', '2007-05-03 14:00:00')
SELECT a.EMPLOYEE#,
a.JOB#,
a.STEP#,
a.MACHINE#,
TX_TIME AS TIME_ON,
COALESCE(
(SELECT MIN(TX_TIME)
FROM TX
WHERE EMPLOYEE# = a.EMPLOYEE#
AND JOB# = a.JOB#
AND STEP# = a.STEP#
AND MACHINE# = a.MACHINE#
AND TX_TIME > a.TX_TIME
AND TX_TYPE = 'OFF'),
CURRENT_TIMESTAMP) AS TIME_OFF
FROM TX a
WHERE TX_TYPE = 'ON'
This view enumerates the intervals per employee. Note an additional row
with STOP_TIME as the current time. This is needed to deal with unpaired
rows in TX.
SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME
(SELECT DISTINCT a.EMPLOYEE#, a.TX_TIME AS START_TIME, (SELECT MIN(TX_TIME) FROM TX WHERE EMPLOYEE# = a.EMPLOYEE# AND TX_TIME > a.TX_TIME) AS STOP_TIME FROM TX a UNION SELECT EMPLOYEE#, MAX(TX_TIME) AS START_TIME, CURRENT_TIMESTAMP AS STOP_TIME FROM TX GROUP BY EMPLOYEE# HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) aWHERE STOP_TIME IS NOT NULL Here's the query: Sorry about the formatting. If you paste it into your favorite fixed font editor, it will be more readable. You may need a substitute for DATEDIFF for your favorite DBMS. This works on Sql Server.
SELECT JOB#,
STEP#, SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS) /3600.0 AS LABOR
FROM
--this part of the query calculates the number of paired rows that occupy each interval
(SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS PAIRS
FROM PAIRED_TX a JOIN INTERVALS b ON (b.EMPLOYEE# = a.EMPLOYEE# AND b.START_TIME >= a.TIME_ON AND b.STOP_TIME <= a.TIME_OFF) GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) cJOIN
--this part of the query splits each pair into individual intervals.
(SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME, STOP_TIME
FROM PAIRED_TX a JOIN INTERVALS b ON (b.EMPLOYEE# = a.EMPLOYEE# AND b.START_TIME >= a.TIME_ON AND b.STOP_TIME <= a.TIME_OFF)) d ON (d.EMPLOYEE# = c.EMPLOYEE# AND d.START_TIME = c.START_TIME AND d.STOP_TIME = c.STOP_TIME)GROUP BY JOB#, STEP# The results from the supplied test data: JOB# STEP# LABOR
-------- -------- ----------
1 1 1.5 2 1 1.0 1 2 0.5 2 2 1.0
>> Since the final pass is common to both approaches, the procedural
>> approach
>> effectively replaces five passes through the data with a single pass.
>>
>> Why would the optimizer fail? Because I don't think it's possible for
>> the
>> optimizer to detect every intrarelational dependency in a table, nor do I
>> think it would be advisable, unless it were possible to enumerate them in
>> the schema.
>>
> Why would you want to have dependencies in the data that were NOT
> enumerated
> in the schema (assuming I understand what you mean by "enumerated")?
> Deosn't the schema describe the data?
>
There may be an implicit order to the data, such as is the case with temporal data.
>
> I hope it's not too much trouble to clear these issues up first. Before
> we
> get to the meat of it.
>
>
>
Received on Sun May 06 2007 - 08:30:01 CEST