A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation

From: Cimode <cimode_at_hotmail.com>
Date: 29 Apr 2007 03:04:03 -0700
Message-ID: <1177841043.623407.63590_at_y5g2000hsa.googlegroups.com>



Among the least explored opportunities RM has to offer comes the issue of numerical / time series linear interpolation of values. Recently, I have solved the following problem on a community board which I encountered this problem a few years ago with different facts while working for the Frech railroad company. I thought about using the recent example (better for pedagogical purposes) to illustrate the work done a few years ago.

To make sure the heavier traffic lines were weel maintained and fit for high speed service, each 2 weeks, the company launched trains equipped with receptors that ran across lines and recorded some data such as passage time and distance. The data was later pourred in a statistical database for analysis and decision were made that some line required additional maintaince (replacement of line segment).

Without getting into too much detail, the process consists of the following segment of reality captured: A train goes from a start train station at a specific start time and ends its ittinerary in another station at a end time. Between these stations, the train crosses several stations where some data was *not* collected for some reason(mainly the receptor was not working). As a consequence, on direct image systems (SQL Server, ORACLE or DB2), NULL pointers were placed instead of the values of passage time and a row was to be added at each station passage. So, we have a *train_passage* table like this (I supposed a SK ID)

ID      Distance  ArrivalTime
24      0.289      2000-01-01 06:29:00.000
25      0.193      NULL --> 1st value to be interpolated
26      0.299      NULL --> 2nd value to be interpolated
27      0.131      NULL --> 3rd value to be interpolated
28      0.444      NULL --> 4th value to be interpolated
29      0.16      NULL -->  5th value to be interpolated
30      0.665      NULL --> 6th value to be interpolated
31      0.186      2000-01-01 06:33:00.000

The point of the problem was to interpolate all NULL ArrivalTime values, based on the following formula.(A static linear formula is used for the moment and there are probably better formulas but the point of the problem is not about the efficiency of the formula):

ArrivalTime for ID = 25:
(Distance for 25 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
31-ArrivalTime for ID 24)
ArrivalTime for ID = 26:
(Distance for 26 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
31-ArrivalTime for ID 24)

and so forth....

In most cases I have encountered, practionners and developpers use a procedural approach (namely cursors) to solve this problem. For each row with unknown datetime values, they executed a function. As a consequence on several million records and because each row performs heavy calculation, the resource consumtion simply grows exponentially. (Mainly because of non separation between the physical and logical layer in direct image systems). So, every two weeks, they had to run a 10 hour batch process to interpolate the missing data. In pseudo code the initial procedural looked like this:

FOR ALL ROWS
START

              IF ROW HAD ALL DATA
                      STORE NEW START TIME
                      STORE NEW END TIME
                     WRITE ALL COMPUTED INTERPOLATED VALUE FROM
PREVIOUS SEGMENT OF LINE INTO FINAL INTERPOLATED TABLE
                      WRITE NEW ROW IN FINAL INTERPOLATED TABLE
              END IF

              IF ROW HAD MISSING DATA
                     COMPUTE INTERPOLATED DATA
                      STORE COMPUTED VALUE IN TEMP
              END IF

              GO TO NEXT ROW DATA

END SHOW ALL VALUES IN FINAL INTERPOLATED TABLE Because of the procedural approach,the process locked up the database in exclusive mode and prevented any concurrent process and had to be launched several time because it sometime simply timed out due to a lack of resources. This is an example that demonstrate that set operations are light years from procedural approaches. I reexpressed the interpolation as a complex UNION operation between two relations and ended up with following SQL statement....(note: this is T-SQL, dateadd is a function that adds a certain numer of time perdiod to a specific datetime)

(

select E.id, E.Distance, dateadd(s, (E.Distance/G.distance) * G.gen, E.start) arrivaltime
from yourtable F inner join
(

select A.ID, A.Distance, B.arrivaltime start, datediff(s,B.arrivaltime, C.arrivaltime) as gen from yourtable A

      left outer join
            (
            select id, distance, arrivaltime from yourtable
            where ArrivalTime is not null
            ) B
      on B.id  < A.id
      left outer join
            (
            select id, distance, arrivaltime from yourtable
            where ArrivalTime is not null
            ) C
      on C.id  > A.id

where A.arrivaltime is null
) E
on E.ID = F.ID

left outer join
(

select sum(A.Distance) distance, datediff(s,B.arrivaltime, C.arrivaltime) as gen
from yourtable A

      left outer join
            (
            select id, distance, arrivaltime from yourtable
            where ArrivalTime is not null
            ) B
      on B.id  < A.id
      left outer join
            (
            select id, distance, arrivaltime from yourtable
            where ArrivalTime is not null
            ) C
      on C.id  > A.id

where A.arrivaltime is null
group by
datediff(s,B.arrivaltime, C.arrivaltime) ) G
on E.gen = G.gen
)
union
(

select ID, Distance, Arrivaltime from yourtable where ArrivalTime is not null
)

As a result, I obtained the following interpolation...

ID      Distance  ArrivalTime
24      0.289      2000-01-01 06:29:00.000
25      0.193      2000-01-01 06:29:24.000
26      0.299      2000-01-01 06:29:37.000
27      0.131      2000-01-01 06:29:16.000
28      0.444      2000-01-01 06:29:56.000
29      0.160      2000-01-01 06:29:20.000
30      0.665      2000-01-01 06:30:24.000
31      0.186      2000-01-01 06:33:00.000

The execution time went from 10 hours to 30 minutes and the database
(in fact the table) was not in exclusive mode anymore (conccurent
queries could be ran during the execution of the interlpolation)...Most of all, no more time outs were to be noted and we had the assurance that the process would not stop at the 9th hour.

This example establishes once more the superiority of set oriented operations over procedural approaches but it also brings some questions...If we can interpolate values from relations, would not we imagine a TRDBMS with a systematic method (based on interpolation) to deal with missing numeric/datetime data. Would not that be a possible systematic method to handle missing data in the spirit of what Codd imagined (I would give a lot to be able to ask him this question if he was alive)...

I would be grateful for your comments, ideas on this subject.... Received on Sun Apr 29 2007 - 12:04:03 CEST

Original text of this message