Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Thu, 17 May 2007 18:23:15 -0400
Message-ID: <oB43i.3100$4Y.2062_at_newssvr19.news.prodigy.net>


"David Cressey" <cressey73_at_verizon.net> wrote in message news:34Z2i.21146$5Z6.19472_at_trndny05...
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:R%t1i.17665$YL5.5318_at_newssvr29.news.prodigy.net...
>>
>> "David Cressey" <cressey73_at_verizon.net> wrote in message
>> news:zEk1i.1282$4a1.80_at_trndny07...
>
[snip]
>> People make mistakes. Sometimes they enter the wrong machine#; sometimes
>> they enter the wrong step#, sometimes they even screw up their own
>> employee#. Someone has to fix them.
>
>
> I took me a while, but I just now noticed that your two assertions above
> are mutually contradictory. "What is in the database is assumed to be
> true"
> and "People make mistakes ... Someone has to fix them." are opposite
> arguments.
>

I guess you could look at it that way, but if you lie to the database in a consistent way, it really doesn't know any better.

> BTW, I finally came up with a set oriented solution that could be matched
> against some procedural (cursor oriented?) solution you might have. It
> seems that MS Access 97 lacks the CASE construct, and the COALESCE
> function. So I had to make do with IIf and isnull (I'm suitably
> embarrassed.) I used an OUTER JOIN instead of a subquery.
>
> Note: I only needed 3 self-joins (4 references to TX) instead of the
> number you gave. I wonder what the reason is for the difference.
>

I think it is the correlated subquery. It's a real killer, requiring a partial scan for each row in TX. If an employee had 1000 rows in TX, then number of fetches required just to compute the SUMs for that employee would be 500,000.

> At the end of this post I've included an SQL script derived from the MS
> Access solution, hopefully without transcription errors.
>
> Now what I need is enough data to make MS Access run slow, so that I can
> see a noticeable delay. ;) (The smiley face is included for lurkers who
> can't recognize irony without it.)
>
> Brian, can you tell me approximately how much time you saved your
> enterprise by doing this procedurally, instead of set oriented?
>

I can tell you that at the time, for around 400,000 rows the response time of the set-based solution was around 3 minutes whereas the procedural solution ran in about 12 seconds, and for around 1,600,000 rows, the response time of the set-based solution was around 24 minutes, whereas the procedural solution ran in about a minute.

> /*
> The segments view transforms the transitions given in TX into segments of
> a
> job step.
> The ratio between elapsed time and labor charged time is constant within a
> segment.
>
> For performance comparison to a procedural approach to the same problem.
>
> DGC 5-16-07
>
> */

There's something about this query that doesn't look right. I may be reading it wrong, but wouldn't this inject extra rows if there are multiple rows in TX for an EMPLOYEE with the same TX_Time? For example, if there were 3 rows in TX for EMPLOYEE e and TX_Time t, wouldn't there be 9 rows in the result of the join? Is that what you intended? I can't remember the correct term for this phenomena: 2 + 2 = 2 * 2, but 3 + 3 != 3 * 3!

>
> CREATE VIEW SEGMENTS AS
> SELECT
> A.Employee,
> C.Job,
> C.Step,
> C.Machine,
> A.TX_TIME AS START_TIME,
> min (IIf (isnull(B.TX_Time), NOW, B.TX_TIME)) AS STOP_TIME,
> (SELECT SUM(IIf (D.TX_Type = 'ON', 1, -1)) FROM TX AS D WHERE
> D.EMPLOYEE = A.EMPLOYEE
> AND D.TX_TIME <= A.TX_TIME) AS PAIRING
> FROM
> (TX AS A
> LEFT OUTER JOIN TX AS B
> ON B.Employee = A.Employee
> AND B.TX_Time > A.TX_Time)
> INNER JOIN TX AS C
> ON C.Employee = A.Employee
> AND C.TX_Time <= A.TX_Time
> GROUP BY
> A.Employee,
> C.Job,
> C.Step,
> C.Machine,
> A.TX_Time
> HAVING
> sum (IIf (C.TX_Type= 'ON', 1, -1)) > 0;
>
> /* The Labor cost Query calculates the amount of employee labor allocated
> to each job step. */
>
>
> CREATE VIEW LABOR_COST AS
> SELECT
> Segments.Job,
> Segments.Step,
> Sum(DateDiff("s",START_TIME,STOP_TIME)/3600/Pairing) AS Labor
> FROM Segments
> GROUP BY
> Segments.Job,
> Segments.Step;
>
> /* The TX Table stores the inputs as given.
> The actual TX table was built using MS Access design view,
> And it is supposed to mimic what's given below. */
> CREATE TABLE TX
> (
> EMPLOYEE# INT NOT NULL,
> JOB# INT NOT NULL,
> STEP# INT NOT NULL,
> MACHINE# INT NOT NULL,
> TX_TYPE CHAR(3) NOT NULL
> CHECK (TX_TYPE = 'ON' OR TX_TYPE = 'OFF'),
> TX_TIME DATETIME NOT NULL,
> PRIMARY KEY (EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TYPE, TX_TIME)
> )
>
> A little bit of test data:
>
> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 08:00:00')
> INSERT TX VALUES (1,2,1,1,'ON', '2007-05-03 08:00:00')
> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 09:00:00')
> INSERT TX VALUES (1,2,1,1,'OFF', '2007-05-03 09:30:00')
> INSERT TX VALUES (1,1,1,1,'ON', '2007-05-03 10:00:00')
> INSERT TX VALUES (1,1,1,1,'OFF', '2007-05-03 11:00:00')
> INSERT TX VALUES (1,1,2,1,'ON', '2007-05-03 11:30:00')
> INSERT TX VALUES (1,1,2,1,'OFF', '2007-05-03 12:00:00')
> INSERT TX VALUES (1,2,2,1,'ON', '2007-05-03 13:00:00')
> INSERT TX VALUES (1,2,2,1,'OFF', '2007-05-03 14:00:00')
>
>
>
> /* Notes:
>
> In my implementation, I used "Iif" and "isnull" instead of "case" and
> "coalesce". "Case" and "coalesce" are better, IMO, but they aren't
> available in MS Access 97, AFAIK.
>
> An additional entry in TX, to provide an unmatched "ON" provides a
> better
> test of the data.
> It's probably a good idea to make about an hour before the present time.*/
>
>
>
>
>
>
>
>
Received on Fri May 18 2007 - 00:23:15 CEST

Original text of this message