Re: Self Joins and optimization

From: David Cressey <cressey73_at_verizon.net>
Date: Sat, 12 May 2007 15:00:47 GMT
Message-ID: <zEk1i.1282$4a1.80_at_trndny07>


"Brian Selzer" <brian_at_selzer-software.com> wrote in message news:ej81i.8350$rO7.3542_at_newssvr25.news.prodigy.net...

[big snip]

> from the terminals and what was stored in the database.) That solution
> suffered from a number of problems (not the least of which was that it
> originally used the special value '01/01/01' in the DATE_OFF field to deal
> with the missing 'OFF' entries. They patched that for Y2K by using
> '01/01/50' instead!?!).

Irrelevant to our discussion.

> From a theoretical standpoint, the main problem
> with it is redundancy. For example, if a row is deleted, then the Pairing
> for every other row for an employee during that interval must be updated.
> The situation is even worse for inserts. If a row is inserted that
doesn't
> coincide exactly with any interval currently in existence for an employee,
> then the segments for each row for that employee during that interval must
> be split with new values for Pairing to account for the additional row.

You have the same kind of problems with the TX table. If you delete an "OFF" entry, then it affects the Pairing for every interval between the entry deleted and the current time. If you insert a new "ON" entry, you have to recompute the pairing for every interval between the new "ON" and its corresponding "OFF".

With all due respect, if this data is being captured in real time, and never altered afterwards, then the anomalies associated with random inserts, updates, and deletes of partially normalized data are irrelevant to the process at hand (both data collection and query processing).

> I
> don't want to get into it here, but constraint checking is also much more
> difficult. Suffice it to say: the above solution is a perfect example of
> why you shouldn't denormalize a database.
>

What do you mean by "difficult"? Difficult to express, or resource hogging when the DBMS goes to enforce it?

> An alternate, normalized solution, but one that introduces nulls, would be
> to pair up each 'ON' row with its corresponding 'OFF' row, similar to the
> view that I provided, PAIRED_TX, but substituting null for
CURRENT_TIMESTAMP
> to signify that the interval is indefinite. The problem with that is that
> the Pairing still needs to be calculated for each employee, job, step,
> machine instance during each interval in order to answer the Labor query.
> This solution does, however, eliminate a couple of the self-joins,
allowing
> for a simplified execution plan, but I still don't think it can compete
with
> the procedural solution. For one thing, identifying the intervals
requires
> a union, which usually involves a sort, and then that result must be
> theta-joined to itself. Once identified, the Pairing must still be
> calculated for each interval.
>
> Another alternate, normalized solution, and one that doesn't introduce
> nulls, would be to split the table horizontally, having one table for 'ON'
> rows and one table for 'OFF' rows. The difficulty with this solution is
> that in order to calculate the intervals, a theta join of the union of
both
> tables to the union of both tables would be required. Not to mention the
> theta join required to pair up the rows. I don't see any benefit over the
> procedural solution here.
>

All of the above discussion flows from the one single beginning point: the data that's captured consists of transitions, and the data that's required consists of events with a duration. When there's a fundamental mismatch between the entities that the capture process records data about and the entities that the query processes inquire about, pain happens. One way to deal with that pain is to adopt a procedural approach. There are other ways. Received on Sat May 12 2007 - 17:00:47 CEST

Original text of this message