Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sat, 12 May 2007 22:30:00 GMT
Message-ID: <Idr1i.17652$YL5.14033_at_newssvr29.news.prodigy.net>


"David Cressey" <cressey73_at_verizon.net> wrote in message news:Lci1i.4920$1X1.767_at_trndny02...
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:ej81i.8350$rO7.3542_at_newssvr25.news.prodigy.net...
>> [big snip]
>
>> Don't take this the wrong way, I understand that this is a first attempt
> at
>> a very complicated problem (I spent several weeks on it back in 2002.),
> but
>> I really don't think the above solution is a move in the right direction.
>> In fact, it is very similar to what I moved away from. It is essentially
>> the same file layout that existed in the old DOS-based Btrieve database
> that
>> I replaced, except that ....
>
> Brian, don't take this the wrong way, either.
>
> I understand that you solved this problem some time ago, and that your
> solution is satisfactory to you. That's not the question I'm addressing
> at
> all. What I'm interested in exploring is your assertion of the
> non-existence of a set oriented solution to the problem, where the set
> oriented solution runs at approximately the same speed as your sequential
> processing solution.
>
> I'm not convinced that multiple self joins are in fact required in order
> to
> produce a set oriented solution. In particular, I'm interested in
> finding
> a solution that bypasses one of the steps in the set oriented solution you
> posted at my request. It's the step where each TX ON is matched with its
> corresponding TX OFF. That step seems unnecessary to me, although I
> can't
> prove it, yet.
>

Unfortunately, it is necessary. How else could you tell which activities an employee was performing during an interval bounded by events adjacent in time? Each row in TX contains only half of the information required to determine that; each row also contains only half of the information required to determine the intervals between events.

Consider:

       t1....t2....t3....t4....t5....t6

a1:   ------------------------
a2:          --------------
a3:                 -----

During interval t3..t4, the employee is performing activities a1, a2 and a3, but the pair of rows that define the interval t3..t4 do not reference a1 or a2 at all, nor do the adjacent rows reference a1. The entire interval for each activity must be calculated in order to determine whether each interval, t1..t2, t2..t3, t3..t4, t4..t5, t5..t6, overlaps it.

> If there is any point at issue between you and me, its your assertion,
> in
> the thread that gave rise to this thread, that Cimode would have been
> better off using cursors in his approach to interpolation. I doubt that
> assertion, and I don't think your example, even if it stands up, proves
> that assertion.
>

The proof is in the pudding. Cimode has yet to provide additional information about the frequency of nulls in the data so that apples can be compared to apples. But after further reflection, I think his query can be answered with two simultaneous passes through the data, provided an index exists on ID. (It is the key, so it follows that there should be an index.) In Sql Server, a table variable would be needed to cache the results, requiring an additional write and read per row making the total passes four. I haven't played with them much, but in Oracle, a pipelined table function could be used that would stream the results directly, eliminating the write/read overhead. Furthermore, since the second cursor follows closely behind the first (see below), leaving only rows with null between, it is highly likely that little if any additional I/O would be required for the second pass through the data. I challenge anybody to find a set-based solution to Cimode's problem that would consistently require fewer than two passes through the table without increasing overhead in other ways, such as the cost of maintaining an additional index or a materialized view.

Here's pseudocode for the meat of a pipelined table function:

fetch x
fetch y
if x%FOUND then

  • skip over stuff that can't be interpolated (initial null rows) while x.ArrivalTime is null and x%FOUND loop fetch x fetch y end loop end if if y%FOUND then fetch y
  • if the last y.ArrivalTime is null, then we're done (skip terminal null rows) while y%FOUND loop
    • skip ahead while accumulating total distance while y.ArrivalTime is null and y%FOUND loop accumulate TotalDistance fetch y end loop if y%FOUND then
      • catch up while interpolating ArrivalTime while x.ArrivalTime is null loop calculate ArrivalTime adjust TotalDistance pipe row fetch x end loop reset TotalDistance
      • don't need to interpolate ArrivalTime := x.ArrivalTime pipe row fetch x fetch y end if end loop end if

In Sql Server, each pipe row above would require an insert. In either case, the entire solution should execute in linear time.

The key here is that there is nothing happening within the fetch loop that touches the database. Two streams are attached as input to the function and the resulting stream is attached to the enclosing select which presents the results. This operates much the same as a merge join optimizer step, which expects its two input streams to already be sorted and produces one output stream.

> Your assertion has enormous implications, both practical and theoretical,
> if it turns out to be true. I can comment better on the practical
> implications than on the theoretical ones.
>
> In practice, many database consultants and/or developers have made a
> minor
> industry out of weeding cursor oriented code out of database applications
> and substituing well engineered set oriented soultions. If your assertion
> is correct, nearly all of these improvements were substantially less than
> optimal, at least from a speed point of view. For example, December 28,
> 2006 Kevin Kirkpatrick (I think) outlined a case where he wrote about
> 300
> lines of SQL that did, in 45 minutes, what would have taken about 300
> hours using the procedural solution he replaced. Kevin's example is more
> startling than yours. And, IMO, more convincing.
>

I don't dispute the benefits of well engineered set-oriented code. I am not asserting that procedure-oriented solutions are always better. I want to make that absolutely clear. But there *are* occasions when a procedure-oriented solution will outperform a set-oriented one. I have now provided two examples, and the key to both is the elimination of self-joins. What I object to is the unfounded and unsupported blanket claim that procedure-oriented solutions are always slower, always reduce concurrency, and always reduce scalability when compared to equivalent set-based solutions.

> I hasten to add that there ARE cases where I would adopt a procedural
> approach, but I would NOT make the performance claims that you made. If
> you're a practical person, you have more tools than just a hammer in your
> tool kit.
>
> I'll leave the theoretical implications of your assertion that cursors are
> better to the theoreticians in this ng.
>
>
>
>
>
>
>
Received on Sun May 13 2007 - 00:30:00 CEST

Original text of this message