Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sat, 12 May 2007 00:58:50 GMT
Message-ID: <ej81i.8350$rO7.3542_at_newssvr25.news.prodigy.net>


"David Cressey" <cressey73_at_verizon.net> wrote in message news:OW41i.24$w51.5_at_trndny09...
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:JBe%h.16839$YL5.304_at_newssvr29.news.prodigy.net...
>>
>> "David Cressey" <cressey73_at_verizon.net> wrote in message
>> news:Vg5%h.166$83.81_at_trndny08...
>> >
>> > "Brian Selzer" <brian_at_selzer-software.com> wrote in message
>> > news:MQ2%h.4456$H_.2052_at_newssvr21.news.prodigy.net...
>> >>
>> >> "David Cressey" <cressey73_at_verizon.net> wrote in message
>> >> news:ZE%_h.163$D52.68_at_trndny04...
>> >> > In the "interpolation" thread, Brian has been expounding on the
>> >> > idea
>> > that
>> >> > he can discover algorithms that he knows to be correct, and that
>> >> > outperform
>> >> > anything an optimizer can generate. He's mentioned "self joins" and
>> >> > the
>> >> > idea that he can get a job doen with only one read, where the
>> >> > automatic
>> >> > solution generated in response to declarative code generates
>> >> > multiple
>> >> > passes
>> >> > through the data.
>> >> >
>> >> > My experience leads me to the opposite conclusion from that.
>> >> >
>> >> > Here's the questions I'd like the newsgroup to consider:
>> >> >
>> >> > What, if any, algorithms that are demonstrably correct are going to
> be
>> >> > overlooked by the optimizer, although discoverable by a human
>> >> > coder?
>> > Why
>> >> > would an optimizer fail to discover the possibility of eliminating
>> >> > self-joins? Are there any undecidability issues here?
>> >> >
>> >> Here's a maybe not-so-simple example:
>> >>
>> >> Given a table with the columns,
>> >>
>> >> EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE
>> >>
>> >> where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire
>> > heading,
>> >>
>> >
>> > I hope I can ask a bunch of questions without derailing your example.
>> > I
>> > need to understand the example before I can come up with either a
> counter
>> > argument, or agreement.
>> >
>> > First, a matter of simple syntax: What does "#" mean? is it part of
> the
>> > attribute name, or does it indicate something about the attribute? Is
>> > this
>> > SQL or some other language?
>> >
>> It's just part of the attribute name.
>> >
>> >> compute the total labor performed to date for each job step.
>> >>
>> >
>> > What is a job step? How is it identified?
>> >
>> A job step is one of the steps required to complete a job, or work order.
>> The distinction between a job and a work order is not material to the
>> discussion. A job step is identified by the conjunction of a value for
> each
>> of the attributes JOB# and STEP#.
>> >
>> >> Notes: (1) an employee cannot be perfoming the same job step on the
> same
>> >> machine at the same time, but (2) intervals can overlap, and when they
>> >> do,
>> >> labor should be distributed between them evenly. For example, an
>> >> employee
>> >> may be monitoring two or more machines during the same period of time,
> so
>> >> his labor during that period should be divided evenly between the job
>> > steps
>> >> on each machine, but that same employee cannot be monitoring the same
> job
>> >> step on the same machine more than once during the same period of
>> >> time.
>> >>
>> >
>> > How do you guarantee that the data will not violate the constraints
>> > declared
>> > in the above notes? Are there declared constraints to prevent the
>> > insertion
>> > of cotradictory data, or is it just "careful programming" on the part
> of
>> > those who write the SW that writes the data to the DB?
>> >
>>
>> I would probably use a trigger. It would be nice to be able to declare a
>> temporal constraint, but I expect it would have to be applied to a view
>> because of the nature of the data in the table. There is one other
>> constraint that I didn't mention: for every row with 'OFF,' there must
> exist
>> a row with 'ON' such that EMPLOYEE#, JOB#, STEP#, MACHINE# are the same
> and
>> that TX_TIME in the row containing 'ON' < TX_TIME in the row containing
>> 'OFF'.
>>
>> > If the constraints were declared, wouldn't this information be
> available
>> > to
>> > the optimizer?
>> >
>>
>> I would assume so, I can't think of how either of the constraints could
>> make a difference in this case.
>>
>> >
>> >> A procedural approach would require two passes through the data: One
> to
>> >> compute labor per job step per employee, and one to compute the
>> >> aggregate,
>> >> labor per job step.
>> >>
>> >> A set-based approach would require a minimum of six passes through the
>> > data:
>> >> Determining the intervals during which each employee was performing a
> set
>> > of
>> >> job steps requires a left outer self-join, or two passes through the
>> >> data.
>> >> Computing the cardinality of each of those sets so that labor can be
>> >> distributed evenly requires an aggregation and then a join--three
>> > additional
>> >> passes. Computing the aggregate, labor per job step, requires a final
>> > pass.
>> >>
>> >
>> > Could you provide SQL for a set based query that computes the correct
>> > result? This would help me to understand the example better.
>> >
>>
>> In order to make things easier to follow, instead of a single query, I
> split
>> the query up by using a couple views.
>>
>> The table definition:
>>
>> CREATE TABLE TX
>> (
>> EMPLOYEE# INT NOT NULL,
>> JOB# INT NOT NULL,
>> STEP# INT NOT NULL,
>> MACHINE# INT NOT NULL,
>> TX_TYPE CHAR(3) NOT NULL
>> CHECK (TX_TYPE = 'ON' OR TX_TYPE = 'OFF'),
>> TX_TIME DATETIME NOT NULL,
>> PRIMARY KEY (EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TYPE, TX_TIME)
>> )
>>
>> A little bit of test data:
>>
>> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 08:00:00')
>> INSERT TX VALUES (1,2,1,1,'ON', '2007-05-03 08:00:00')
>> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 09:00:00')
>> INSERT TX VALUES (1,2,1,1,'OFF', '2007-05-03 09:30:00')
>> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 10:00:00')
>> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 11:00:00')
>> INSERT TX VALUES (1,1,2,1,'ON', '2007-05-03 11:30:00')
>> INSERT TX VALUES (1,1,2,1,'OFF', '2007-05-03 12:00:00')
>> INSERT TX VALUES (1,2,2,1,'ON', '2007-05-03 13:00:00')
>> INSERT TX VALUES (1,2,2,1,'OFF', '2007-05-03 14:00:00')
>>
>> This view pairs up the rows in TX. For pairs without an 'OFF' row,
>> substitute the current time for TIME_OFF:
>>
>> CREATE VIEW PAIRED_TX AS
>> SELECT a.EMPLOYEE#,
>> a.JOB#,
>> a.STEP#,
>> a.MACHINE#,
>> TX_TIME AS TIME_ON,
>> COALESCE(
>> (SELECT MIN(TX_TIME)
>> FROM TX
>> WHERE EMPLOYEE# = a.EMPLOYEE#
>> AND JOB# = a.JOB#
>> AND STEP# = a.STEP#
>> AND MACHINE# = a.MACHINE#
>> AND TX_TIME > a.TX_TIME
>> AND TX_TYPE = 'OFF'),
>> CURRENT_TIMESTAMP) AS TIME_OFF
>> FROM TX a
>> WHERE TX_TYPE = 'ON'
>>
>> This view enumerates the intervals per employee. Note an additional row
>> with STOP_TIME as the current time. This is needed to deal with unpaired
>> rows in TX.
>>
>> CREATE VIEW INTERVALS AS
>> SELECT a.EMPLOYEE#, a.START_TIME, a.STOP_TIME
>> FROM
>> (SELECT DISTINCT a.EMPLOYEE#,
>> a.TX_TIME AS START_TIME,
>> (SELECT MIN(TX_TIME)
>> FROM TX
>> WHERE EMPLOYEE# = a.EMPLOYEE#
>> AND TX_TIME > a.TX_TIME) AS STOP_TIME
>> FROM TX a
>> UNION
>> SELECT EMPLOYEE#,
>> MAX(TX_TIME) AS START_TIME,
>> CURRENT_TIMESTAMP AS STOP_TIME
>> FROM TX
>> GROUP BY EMPLOYEE#
>> HAVING SUM(CASE WHEN TX_TYPE = 'ON' THEN 1 ELSE -1 END) > 0) a
>> WHERE STOP_TIME IS NOT NULL
>>
>> Here's the query: Sorry about the formatting. If you paste it into your
>> favorite fixed font editor, it will be more readable. You may need a
>> substitute for DATEDIFF for your favorite DBMS. This works on Sql
>> Server.
>>
>> SELECT JOB#,
>> STEP#,
>> SUM(CAST(DATEDIFF(ss, d.START_TIME, d.STOP_TIME) AS FLOAT) /
>> PAIRS)
> /
>> 3600.0 AS LABOR
>> FROM
>> --this part of the query calculates the number of paired rows that occupy
>> each interval
>> (SELECT b.EMPLOYEE#, START_TIME, STOP_TIME, CAST(COUNT(*) AS FLOAT)
>> AS
>> PAIRS
>> FROM PAIRED_TX a
>> JOIN INTERVALS b
>> ON (b.EMPLOYEE# = a.EMPLOYEE#
>> AND b.START_TIME >= a.TIME_ON
>> AND b.STOP_TIME <= a.TIME_OFF)
>> GROUP BY b.EMPLOYEE#, b.START_TIME, b.STOP_TIME) c
>> JOIN
>> --this part of the query splits each pair into individual intervals.
>> (SELECT a.EMPLOYEE#, a.JOB#, a.STEP#, a.MACHINE#, START_TIME,
>> STOP_TIME
>> FROM PAIRED_TX a
>> JOIN INTERVALS b
>> ON (b.EMPLOYEE# = a.EMPLOYEE#
>> AND b.START_TIME >= a.TIME_ON
>> AND b.STOP_TIME <= a.TIME_OFF)) d
>> ON (d.EMPLOYEE# = c.EMPLOYEE#
>> AND d.START_TIME = c.START_TIME
>> AND d.STOP_TIME = c.STOP_TIME)
>> GROUP BY JOB#, STEP#
>>
>> The results from the supplied test data:
>> JOB# STEP# LABOR
>> -------- -------- ----------
>> 1 1 1.5
>> 2 1 1.0
>> 1 2 0.5
>> 2 2 1.0
>>
>> >> Since the final pass is common to both approaches, the procedural
>> >> approach
>> >> effectively replaces five passes through the data with a single pass.
>> >>
>> >> Why would the optimizer fail? Because I don't think it's possible for
>> >> the
>> >> optimizer to detect every intrarelational dependency in a table, nor
>> >> do
> I
>> >> think it would be advisable, unless it were possible to enumerate them
> in
>> >> the schema.
>> >>
>> > Why would you want to have dependencies in the data that were NOT
>> > enumerated
>> > in the schema (assuming I understand what you mean by "enumerated")?
>> > Deosn't the schema describe the data?
>> >
>>
>> There may be an implicit order to the data, such as is the case with
>> temporal data.
>>
>> >
>> > I hope it's not too much trouble to clear these issues up first.
>> > Before
>> > we
>> > get to the meat of it.
>> >
>> >
>> >
>>
>>
>
> I decided to look into this example a little bit at a time. First, I
> decided on tackling an easier problem. I built a table, called "segments"
> that looks like this.
>
> "Employee","Job","Step","Machine","Segment_Start","Segment_End","Pairing"
> 1,1,1,1,5/3/07 8:00:00,5/3/07 9:00:00,2
> 1,2,1,1,5/3/07 8:00:00,5/3/07 9:00:00,2
> 1,2,1,1,5/3/07 9:00:00,5/3/07 9:30:00,1
> 1,1,1,1,5/3/07 10:00:00,5/3/07 11:00:00,1
> 1,1,2,1,5/3/07 11:30:00,5/3/07 12:00:00,1
> 1,2,2,1,5/3/07 13:00:00,5/3/07 14:00:00,1
>
> This is from MS Access. (Ok, snicker if you must, but it's what I have on
> this particular computer.) note that I changed "Employee#" to "Employee"
> and so forth. Sorry, Brian.
>
> Creating the labor query is a snap starting from here. Here's the SQL,
> again from MS Access.
>
> SELECT Segments.Job, Segments.Step,
> Sum(DateDiff("s",Segment_Start,Segment_End)/3600/Pairing) AS Labor
> FROM Segments
> GROUP BY Segments.Job, Segments.Step;
>
> Which gives me this:
>
> Job Step Labor
> 1 1 1.5
> 1 2 0.5
> 2 1 1
> 2 2 1
>
>
> Same as Brian's, except for order.
>
> The worst I expect from this is a sweep through the data and a sort, in
> order to do the "GROUP BY". In the best of all possible worlds, there's
> an
> index that will obviate the need for the sort.
>
> Now to compare this input with the data Brian gave me, in table TX.
>
> The first thing to note is that, if you treat this data as a graph, the
> TXes are nodes and the Segments are arcs, subject to the following
> provisos:
>
> 1. You would expect, since the TX data contained 5 pairs, that there
> would
> be five segments, but there are actually six. One of the pairs from the
> TX
> data has been split into two segments, in order to keep the Pairing data
> simple. In other words, whenever the fraction of an employee that's
> dedicated to a machine changes in the middle of a monitoring interval,
> the
> interval gets split into two segments (which may get further split down
> the
> line).
>
> 2. The two TXes at 8:00 are would actually project into a single node,
> with two arcs coming out of it. This is useful, because it makes
> computing
> the pairing easy.
>
> 3. The TX at 9:00 is interesting, because it becomes two segment ending
> points, and one segment start point. This sort of thing will happen
> whenever an employee is reassigned from 2 machines to 3, or, as in this
> case, from 2 machines down to 1.
>
> 4. I'm not dealing with dangling monitor assignments (missing the 'OFF'
> entry), yet. I expect to supply that data when the Segments table is
> constructed. So the Segments table would contain any neccsary segments
> to
> complete any assignments that were still in progress when the data was
> posted.
>
> 5. In order to connect all the activities for a single employee into a
> single graph, one might need arcs for "idle time", that is, time during
> which an employee was assigned to zero machines. An example from the data
> is from 9:30 to 10:00. for purposes of the labor query, this is
> unnecessary.
>
> 6. Extending the problem to multiple employees seems straightforward.
> Each
> employee would have his own graph of assignments, but they could all be
> stored in the segements table.
>
> So, my next step is to look for some cool techniqies for computing all the
> arcs of a graph, given all the nodes. I figure there's probably a
> write-up
> on that somewhere. With any luck, I'll end up with something that is:
> logically correct, set oriented, optimizer friendly, coherent and
> legible,
> and hopefully flexible enough to revise and extend as the scope creeps.
>
> I could end up having to cede the point to Brian, and adopt a cursor
> based
> approach for converting TX into Segments. It seems eminently doable,
> although I haven't tried.
>
> I have no idea how to find out what algorithms MS Access uses. I wouldn't
> care, anyway. A good optimizer would recognize the cardinalities in this
> test data, and realize that optimizing is a waste of time for such small
> numbers. I'll wait until I've got something worth trying, and then see
> if
> I can get data in the thousands or tens of thousands of rows, before I
> even
> ask an optimizer to strategize on the query.
>
> Brian, I'm pretty sure this isn't what you were looking for, but it's a
> baby step, and I think in the right direction.
>

Don't take this the wrong way, I understand that this is a first attempt at a very complicated problem (I spent several weeks on it back in 2002.), but I really don't think the above solution is a move in the right direction. In fact, it is very similar to what I moved away from. It is essentially the same file layout that existed in the old DOS-based Btrieve database that I replaced, except that instead of a Pairing column, it had two fields: HOURS and CHGD_HRS, for elapsed hours and charged hours respectively, and it had an additional STATUS field, which indicated whether or not a split had occurred. (If you think about it, you'll see that the above schema doesn't map directly to TX, because it's possible to have an 'OFF' row and an 'ON' row for an employee, job, step and machine with the same TX_TIME. The STATUS field ensured that there was a 1:1 mapping between what was entered from the terminals and what was stored in the database.) That solution suffered from a number of problems (not the least of which was that it originally used the special value '01/01/01' in the DATE_OFF field to deal with the missing 'OFF' entries. They patched that for Y2K by using '01/01/50' instead!?!). From a theoretical standpoint, the main problem with it is redundancy. For example, if a row is deleted, then the Pairing for every other row for an employee during that interval must be updated. The situation is even worse for inserts. If a row is inserted that doesn't coincide exactly with any interval currently in existence for an employee, then the segments for each row for that employee during that interval must be split with new values for Pairing to account for the additional row. I don't want to get into it here, but constraint checking is also much more difficult. Suffice it to say: the above solution is a perfect example of why you shouldn't denormalize a database.

An alternate, normalized solution, but one that introduces nulls, would be to pair up each 'ON' row with its corresponding 'OFF' row, similar to the view that I provided, PAIRED_TX, but substituting null for CURRENT_TIMESTAMP to signify that the interval is indefinite. The problem with that is that the Pairing still needs to be calculated for each employee, job, step, machine instance during each interval in order to answer the Labor query. This solution does, however, eliminate a couple of the self-joins, allowing for a simplified execution plan, but I still don't think it can compete with the procedural solution. For one thing, identifying the intervals requires a union, which usually involves a sort, and then that result must be theta-joined to itself. Once identified, the Pairing must still be calculated for each interval.

Another alternate, normalized solution, and one that doesn't introduce nulls, would be to split the table horizontally, having one table for 'ON' rows and one table for 'OFF' rows. The difficulty with this solution is that in order to calculate the intervals, a theta join of the union of both tables to the union of both tables would be required. Not to mention the theta join required to pair up the rows. I don't see any benefit over the procedural solution here. Received on Sat May 12 2007 - 02:58:50 CEST

Original text of this message