Re: Self Joins and optimization

From: David Cressey <cressey73_at_verizon.net>
Date: Fri, 11 May 2007 21:07:58 GMT
Message-ID: <OW41i.24$w51.5_at_trndny09>

"Brian Selzer" <brian_at_selzer-software.com> wrote in message news:JBe%h.16839$YL5.304_at_newssvr29.news.prodigy.net...
>
> "David Cressey" <cressey73_at_verizon.net> wrote in message
> news:Vg5%h.166$83.81_at_trndny08...
> >
> > "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> > news:MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net...
> >>
> >> "David Cressey" <cressey73_at_verizon.net> wrote in message
> >> news:ZE%_h.163$D52.68_at_trndny04...
> >> > In the "interpolation" thread, Brian has been expounding on the idea
> > that
> >> > he can discover algorithms that he knows to be correct, and that
> >> > outperform
> >> > anything an optimizer can generate. He's mentioned "self joins" and
> >> > the
> >> > idea that he can get a job doen with only one read, where the
> >> > automatic
> >> > solution generated in response to declarative code generates multiple
> >> > passes
> >> > through the data.
> >> >
> >> > My experience leads me to the opposite conclusion from that.
> >> >
> >> > Here's the questions I'd like the newsgroup to consider:
> >> >
> >> > What, if any, algorithms that are demonstrably correct are going to
be
> >> > overlooked by the optimizer, although discoverable by a human coder?
> > Why
> >> > would an optimizer fail to discover the possibility of eliminating
> >> > self-joins? Are there any undecidability issues here?
> >> >
> >> Here's a maybe not-so-simple example:
> >>
> >> Given a table with the columns,
> >>
> >> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
> >>
> >> where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire
> > heading,
> >>
> >
> > I hope I can ask a bunch of questions without derailing your example. I
> > need to understand the example before I can come up with either a
counter
> > argument, or agreement.
> >
> > First, a matter of simple syntax: What does "#" mean? is it part of
the
> > attribute name, or does it indicate something about the attribute? Is
> > this
> > SQL or some other language?
> >
> It's just part of the attribute name.
> >
> >> compute the total labor performed to date for each job step.
> >>
> >
> > What is a job step? How is it identified?
> >
> A job step is one of the steps required to complete a job, or work order.
> The distinction between a job and a work order is not material to the
> discussion. A job step is identified by the conjunction of a value for
each
> of the attributes JOB# and STEP#.
> >
> >> Notes: (1) an employee cannot be perfoming the same job step on the
same
> >> machine at the same time, but (2) intervals can overlap, and when they
> >> do,
> >> labor should be distributed between them evenly. For example, an
> >> employee
> >> may be monitoring two or more machines during the same period of time,
so
> >> his labor during that period should be divided evenly between the job
> > steps
> >> on each machine, but that same employee cannot be monitoring the same
job
> >> step on the same machine more than once during the same period of time.
> >>
> >
> > How do you guarantee that the data will not violate the constraints
> > declared
> > in the above notes? Are there declared constraints to prevent the
> > insertion
> > of cotradictory data, or is it just "careful programming" on the part
of
> > those who write the SW that writes the data to the DB?
> >
>
> I would probably use a trigger. It would be nice to be able to declare a
> temporal constraint, but I expect it would have to be applied to a view
> because of the nature of the data in the table. There is one other
> constraint that I didn't mention: for every row with 'OFF,' there must
exist
> a row with 'ON' such that EMPLOYEE#, JOB#, STEP#, MACHINE# are the same
and
> that TX_TIME in the row containing 'ON' < TX_TIME in the row containing
> 'OFF'.
>
> > If the constraints were declared, wouldn't this information be
available
> > to
> > the optimizer?
> >
>
> I would assume so, I can't think of how either of the constraints could
> make a difference in this case.
>
> >
> >> A procedural approach would require two passes through the data: One
to
> >> compute labor per job step per employee, and one to compute the
> >> aggregate,
> >> labor per job step.
> >>
> >> A set-based approach would require a minimum of six passes through the
> > data:
> >> Determining the intervals during which each employee was performing a
set
> > of
> >> job steps requires a left outer self-join, or two passes through the
> >> data.
> >> Computing the cardinality of each of those sets so that labor can be
> >> distributed evenly requires an aggregation and then a join--three
> > additional
> >> passes. Computing the aggregate, labor per job step, requires a final
> > pass.
> >>
> >
> > Could you provide SQL for a set based query that computes the correct
> > result? This would help me to understand the example better.
> >
>
> In order to make things easier to follow, instead of a single query, I
split
> the query up by using a couple views.
>
> The table definition:
>
> CREATE TABLE TX
> (
> EMPLOYEE# INT NOT NULL,
> JOB# INT NOT NULL,
> STEP# INT NOT NULL,
> MACHINE# INT NOT NULL,
> TX_TYPE CHAR(3) NOT NULL
> CHECK (TX_TYPE = 'ON' OR TX_TYPE = 'OFF'),
> TX_TIME DATETIME NOT NULL,
> PRIMARY KEY (EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TYPE, TX_TIME)
> )
>
> A little bit of test data:
>
> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 08:00:00')
> INSERT TX VALUES (1,2,1,1,'ON', '2007-05-03 08:00:00')
> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 09:00:00')
> INSERT TX VALUES (1,2,1,1,'OFF', '2007-05-03 09:30:00')
> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 10:00:00')
> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 11:00:00')
> INSERT TX VALUES (1,1,2,1,'ON', '2007-05-03 11:30:00')
> INSERT TX VALUES (1,1,2,1,'OFF', '2007-05-03 12:00:00')
> INSERT TX VALUES (1,2,2,1,'ON', '2007-05-03 13:00:00')
> INSERT TX VALUES (1,2,2,1,'OFF', '2007-05-03 14:00:00')
>
> This view pairs up the rows in TX. For pairs without an 'OFF' row,
> substitute the current time for TIME_OFF:
>
> CREATE VIEW PAIRED_TX AS
> SELECT a.EMPLOYEE#,
> a.JOB#,
> a.STEP#,
> a.MACHINE#,
> TX_TIME AS TIME_ON,
> COALESCE(
> (SELECT MIN(TX_TIME)
> FROM TX
> WHERE EMPLOYEE# = a.EMPLOYEE#
> AND JOB# = a.JOB#
> AND STEP# = a.STEP#
> AND MACHINE# = a.MACHINE#
> AND TX_TIME > a.TX_TIME
> AND TX_TYPE = 'OFF'),
> CURRENT_TIMESTAMP) AS TIME_OFF
> FROM TX a
> WHERE TX_TYPE = 'ON'
>
> This view enumerates the intervals per employee. Note an additional row
> with STOP_TIME as the current time. This is needed to deal with unpaired
> rows in TX.
>
> CREATE VIEW INTERVALS AS
> SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME
> FROM
> (SELECT DISTINCT a.EMPLOYEE#,
> a.TX_TIME AS START_TIME,
> (SELECT MIN(TX_TIME)
> FROM TX
> WHERE EMPLOYEE# = a.EMPLOYEE#
> AND TX_TIME > a.TX_TIME) AS STOP_TIME
> FROM TX a
> UNION
> SELECT EMPLOYEE#,
> MAX(TX_TIME) AS START_TIME,
> CURRENT_TIMESTAMP AS STOP_TIME
> FROM TX
> GROUP BY EMPLOYEE#
> HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) a
> WHERE STOP_TIME IS NOT NULL
>
> Here's the query: Sorry about the formatting. If you paste it into your
> favorite fixed font editor, it will be more readable. You may need a
> substitute for DATEDIFF for your favorite DBMS. This works on Sql Server.
>
> SELECT JOB#,
> STEP#,
> SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) / PAIRS)
/
> 3600.0 AS LABOR
> FROM
> --this part of the query calculates the number of paired rows that occupy
> each interval
> (SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT) AS
> PAIRS
> FROM PAIRED_TX a
> JOIN INTERVALS b
> ON (b.EMPLOYEE# = a.EMPLOYEE#
> AND b.START_TIME >= a.TIME_ON
> AND b.STOP_TIME <= a.TIME_OFF)
> GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) c
> JOIN
> --this part of the query splits each pair into individual intervals.
> (SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME,
> STOP_TIME
> FROM PAIRED_TX a
> JOIN INTERVALS b
> ON (b.EMPLOYEE# = a.EMPLOYEE#
> AND b.START_TIME >= a.TIME_ON
> AND b.STOP_TIME <= a.TIME_OFF)) d
> ON (d.EMPLOYEE# = c.EMPLOYEE#
> AND d.START_TIME = c.START_TIME
> AND d.STOP_TIME = c.STOP_TIME)
> GROUP BY JOB#, STEP#
>
> The results from the supplied test data:
> JOB# STEP# LABOR
> -------- -------- ----------
> 1 1 1.5
> 2 1 1.0
> 1 2 0.5
> 2 2 1.0
>
> >> Since the final pass is common to both approaches, the procedural
> >> approach
> >> effectively replaces five passes through the data with a single pass.
> >>
> >> Why would the optimizer fail? Because I don't think it's possible for
> >> the
> >> optimizer to detect every intrarelational dependency in a table, nor do
I
> >> think it would be advisable, unless it were possible to enumerate them
in
> >> the schema.
> >>
> > Why would you want to have dependencies in the data that were NOT
> > enumerated
> > in the schema (assuming I understand what you mean by "enumerated")?
> > Deosn't the schema describe the data?
> >
>
> There may be an implicit order to the data, such as is the case with
> temporal data.
>
> >
> > I hope it's not too much trouble to clear these issues up first. Before
> > we
> > get to the meat of it.
> >
> >
> >
>
>

I decided to look into this example a little bit at a time. First, I decided on tackling an easier problem. I built a table, called "segments" that looks like this.

"Employee","Job","Step","Machine","Segment_Start","Segment_End","Pairing"

1,1,1,1,5/3/07 8:00:00,5/3/07 9:00:00,2
1,2,1,1,5/3/07 8:00:00,5/3/07 9:00:00,2
1,2,1,1,5/3/07 9:00:00,5/3/07 9:30:00,1
1,1,1,1,5/3/07 10:00:00,5/3/07 11:00:00,1
1,1,2,1,5/3/07 11:30:00,5/3/07 12:00:00,1
1,2,2,1,5/3/07 13:00:00,5/3/07 14:00:00,1

This is from MS Access. (Ok, snicker if you must, but it's what I have on this particular computer.) note that I changed "Employee#" to "Employee" and so forth. Sorry, Brian.

Creating the labor query is a snap starting from here. Here's the SQL, again from MS Access.

SELECT Segments.Job, Segments.Step,
Sum(DateDiff("s",Segment_Start,Segment_End)/3600/Pairing) AS Labor FROM Segments
GROUP BY Segments.Job, Segments.Step;

Which gives me this:

Job Step Labor

1      1     1.5
1      2     0.5
2      1     1
2      2     1

Same as Brian's, except for order.

The worst I expect from this is a sweep through the data and a sort, in order to do the "GROUP BY". In the best of all possible worlds, there's an index that will obviate the need for the sort.

Now to compare this input with the data Brian gave me, in table TX.

The first thing to note is that, if you treat this data as a graph, the TXes are nodes and the Segments are arcs, subject to the following provisos:

You would expect, since the TX data contained 5 pairs, that there would be five segments, but there are actually six. One of the pairs from the TX data has been split into two segments, in order to keep the Pairing data simple. In other words, whenever the fraction of an employee that's dedicated to a machine changes in the middle of a monitoring interval, the interval gets split into two segments (which may get further split down the line).
The two TXes at 8:00 are would actually project into a single node, with two arcs coming out of it. This is useful, because it makes computing the pairing easy.
The TX at 9:00 is interesting, because it becomes two segment ending points, and one segment start point. This sort of thing will happen whenever an employee is reassigned from 2 machines to 3, or, as in this case, from 2 machines down to 1.
I'm not dealing with dangling monitor assignments (missing the 'OFF' entry), yet. I expect to supply that data when the Segments table is constructed. So the Segments table would contain any neccsary segments to complete any assignments that were still in progress when the data was posted.
In order to connect all the activities for a single employee into a single graph, one might need arcs for "idle time", that is, time during which an employee was assigned to zero machines. An example from the data is from 9:30 to 10:00. for purposes of the labor query, this is unnecessary.
Extending the problem to multiple employees seems straightforward. Each employee would have his own graph of assignments, but they could all be stored in the segements table.

So, my next step is to look for some cool techniqies for computing all the arcs of a graph, given all the nodes. I figure there's probably a write-up on that somewhere. With any luck, I'll end up with something that is: logically correct, set oriented, optimizer friendly, coherent and legible, and hopefully flexible enough to revise and extend as the scope creeps.

I could end up having to cede the point to Brian, and adopt a cursor based approach for converting TX into Segments. It seems eminently doable, although I haven't tried.

I have no idea how to find out what algorithms MS Access uses. I wouldn't care, anyway. A good optimizer would recognize the cardinalities in this test data, and realize that optimizing is a waste of time for such small numbers. I'll wait until I've got something worth trying, and then see if I can get data in the thousands or tens of thousands of rows, before I even ask an optimizer to strategize on the query.

Brian, I'm pretty sure this isn't what you were looking for, but it's a baby step, and I think in the right direction. Received on Fri May 11 2007 - 23:07:58 CEST

This message: [ Message body ]
Next message: Vadim Tropashko: "Re: Self Joins and optimization"
Previous message: Vadim Tropashko: "Relational symmetric difference is well defined"
Maybe in reply to: David Cressey: "Self Joins and optimization"
In reply to Brian Selzer: "Re: Self Joins and optimization"
Next in thread: Vadim Tropashko: "Re: Self Joins and optimization"
Reply: Brian Selzer: "Re: Self Joins and optimization"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message