Re: Self Joins and optimization

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sun, 13 May 2007 01:40:01 GMT
Message-ID: <R%t1i.17665$YL5.5318_at_newssvr29.news.prodigy.net>


"David Cressey" <cressey73_at_verizon.net> wrote in message news:zEk1i.1282$4a1.80_at_trndny07...
>
> "Brian Selzer" <brian_at_selzer-software.com> wrote in message
> news:ej81i.8350$rO7.3542_at_newssvr25.news.prodigy.net...
>
> [big snip]
>
>> from the terminals and what was stored in the database.) That solution
>> suffered from a number of problems (not the least of which was that it
>> originally used the special value '01/01/01' in the DATE_OFF field to
>> deal
>> with the missing 'OFF' entries. They patched that for Y2K by using
>> '01/01/50' instead!?!).
>
> Irrelevant to our discussion.
>
That's why it's in parenthesis.
>
>> From a theoretical standpoint, the main problem
>> with it is redundancy. For example, if a row is deleted, then the
>> Pairing
>> for every other row for an employee during that interval must be updated.
>> The situation is even worse for inserts. If a row is inserted that
> doesn't
>> coincide exactly with any interval currently in existence for an
>> employee,
>> then the segments for each row for that employee during that interval
>> must
>> be split with new values for Pairing to account for the additional row.
>
> You have the same kind of problems with the TX table. If you delete an
> "OFF" entry, then it affects the Pairing for every interval between the
> entry deleted and the current time. If you insert a new "ON" entry, you
> have to recompute the pairing for every interval between the new "ON" and
> its corresponding "OFF".
>

I wouldn't consider them the same kind of problems. No recomputation is necessary until it is queried! What is recorded in the database is assumed to be true, so the delete and insert anomalies you're describing to are not anomalies at all--just the current state of affairs. Of course, if you try to delete an 'ON' row where there also exists a corresponding 'OFF' row, then that would leave the table in an inconsistent state, so therefore, that delete should be rejected.

> With all due respect, if this data is being captured in real time, and
> never altered afterwards, then the anomalies associated with random
> inserts, updates, and deletes of partially normalized data are irrelevant
> to
> the process at hand (both data collection and query processing).
>
>
People make mistakes. Sometimes they enter the wrong machine#; sometimes they enter the wrong step#, sometimes they even screw up their own employee#. Someone has to fix them.
>
>
>> I
>> don't want to get into it here, but constraint checking is also much more
>> difficult. Suffice it to say: the above solution is a perfect example of
>> why you shouldn't denormalize a database.
>>
>
> What do you mean by "difficult"? Difficult to express, or resource
> hogging
> when the DBMS goes to enforce it?
>

Both. First of all, each activity (employee, job, step, machine, time_on, time_off) is now a linked list of segments. So a constraint should be written to prevent deleting a segment from the middle of the list, or from the end of the list without also updating its predecessor. Second, there can be no overlapping intervals for an employee that do not completely coincide: if intervals overlap, they must be identical. Third, the Pairing values must be the same for each element in a set of overlapping intervals for an employee, Fourth (and it probably implies the third), the cardinality of each set of overlapping intervals must be equal to the Pairing value in each element.

Each of these constraints is difficult to express, and in every commercial system I'm aware of, must be enforced using triggers.

>
>> An alternate, normalized solution, but one that introduces nulls, would
>> be
>> to pair up each 'ON' row with its corresponding 'OFF' row, similar to the
>> view that I provided, PAIRED_TX, but substituting null for
> CURRENT_TIMESTAMP
>> to signify that the interval is indefinite. The problem with that is
>> that
>> the Pairing still needs to be calculated for each employee, job, step,
>> machine instance during each interval in order to answer the Labor query.
>> This solution does, however, eliminate a couple of the self-joins,
> allowing
>> for a simplified execution plan, but I still don't think it can compete
> with
>> the procedural solution. For one thing, identifying the intervals
> requires
>> a union, which usually involves a sort, and then that result must be
>> theta-joined to itself. Once identified, the Pairing must still be
>> calculated for each interval.
>>
>> Another alternate, normalized solution, and one that doesn't introduce
>> nulls, would be to split the table horizontally, having one table for
>> 'ON'
>> rows and one table for 'OFF' rows. The difficulty with this solution is
>> that in order to calculate the intervals, a theta join of the union of
> both
>> tables to the union of both tables would be required. Not to mention the
>> theta join required to pair up the rows. I don't see any benefit over
>> the
>> procedural solution here.
>>
>
> All of the above discussion flows from the one single beginning point:
> the
> data that's captured consists of transitions, and the data that's
> required
> consists of events with a duration. When there's a fundamental mismatch
> between the entities that the capture process records data about and the
> entities that the query processes inquire about, pain happens. One way
> to
> deal with that pain is to adopt a procedural approach. There are other
> ways.
>

I didn't say that there weren't. Received on Sun May 13 2007 - 03:40:01 CEST

Original text of this message