A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation
Date: 29 Apr 2007 03:04:03 -0700
Message-ID: <1177841043.623407.63590_at_y5g2000hsa.googlegroups.com>
Among the least explored opportunities RM has to offer comes the issue of numerical / time series linear interpolation of values. Recently, I have solved the following problem on a community board which I encountered this problem a few years ago with different facts while working for the Frech railroad company. I thought about using the recent example (better for pedagogical purposes) to illustrate the work done a few years ago.
To make sure the heavier traffic lines were weel maintained and fit for high speed service, each 2 weeks, the company launched trains equipped with receptors that ran across lines and recorded some data such as passage time and distance. The data was later pourred in a statistical database for analysis and decision were made that some line required additional maintaince (replacement of line segment).
Without getting into too much detail, the process consists of the
following segment of reality captured: A train goes from a start train
station at a specific start time and ends its ittinerary in another
station at a end time. Between these stations, the train crosses
several stations where some data was *not* collected for some
reason(mainly the receptor was not working). As a consequence, on
direct image systems (SQL Server, ORACLE or DB2), NULL pointers were
placed instead of the values of passage time and a row was to be added
at each station passage. So, we have a *train_passage* table like
this (I supposed a SK ID)
The point of the problem was to interpolate all NULL ArrivalTime
values, based on the following formula.(A static linear formula is
ID Distance ArrivalTime
24 0.289 2000-01-01 06:29:00.000
25 0.193 NULL --> 1st value to be interpolated
26 0.299 NULL --> 2nd value to be interpolated
27 0.131 NULL --> 3rd value to be interpolated
28 0.444 NULL --> 4th value to be interpolated
29 0.16 NULL --> 5th value to be interpolated
30 0.665 NULL --> 6th value to be interpolated
31 0.186 2000-01-01 06:33:00.000
ArrivalTime for ID = 25:
(Distance for 25 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
31-ArrivalTime for ID 24)
ArrivalTime for ID = 26:
(Distance for 26 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
31-ArrivalTime for ID 24)
and so forth....
In most cases I have encountered, practionners and developpers use a procedural approach (namely cursors) to solve this problem. For each row with unknown datetime values, they executed a function. As a consequence on several million records and because each row performs heavy calculation, the resource consumtion simply grows exponentially. (Mainly because of non separation between the physical and logical layer in direct image systems). So, every two weeks, they had to run a 10 hour batch process to interpolate the missing data. In pseudo code the initial procedural looked like this:
FOR ALL ROWS
START
IF ROW HAD ALL DATA STORE NEW START TIME STORE NEW END TIME WRITE ALL COMPUTED INTERPOLATED VALUE FROM PREVIOUS SEGMENT OF LINE INTO FINAL INTERPOLATED TABLE WRITE NEW ROW IN FINAL INTERPOLATED TABLE END IF IF ROW HAD MISSING DATA COMPUTE INTERPOLATED DATA STORE COMPUTED VALUE IN TEMP END IF GO TO NEXT ROW DATA
END SHOW ALL VALUES IN FINAL INTERPOLATED TABLE Because of the procedural approach,the process locked up the database in exclusive mode and prevented any concurrent process and had to be launched several time because it sometime simply timed out due to a lack of resources. This is an example that demonstrate that set operations are light years from procedural approaches. I reexpressed the interpolation as a complex UNION operation between two relations and ended up with following SQL statement....(note: this is T-SQL, dateadd is a function that adds a certain numer of time perdiod to a specific datetime)
(
select E.id, E.Distance, dateadd(s, (E.Distance/G.distance) * G.gen,
E.start) arrivaltime
from yourtable F inner join
(
select A.ID, A.Distance, B.arrivaltime start,
datediff(s,B.arrivaltime, C.arrivaltime) as gen
from yourtable A
left outer join ( select id, distance, arrivaltime from yourtable where ArrivalTime is not null ) B on B.id < A.id left outer join ( select id, distance, arrivaltime from yourtable where ArrivalTime is not null ) C on C.id > A.id
where A.arrivaltime is null
) E
on E.ID = F.ID
left outer join
(
select sum(A.Distance) distance, datediff(s,B.arrivaltime,
C.arrivaltime) as gen
from yourtable A
left outer join ( select id, distance, arrivaltime from yourtable where ArrivalTime is not null ) B on B.id < A.id left outer join ( select id, distance, arrivaltime from yourtable where ArrivalTime is not null ) C on C.id > A.id
where A.arrivaltime is null
group by
datediff(s,B.arrivaltime, C.arrivaltime) ) G
on E.gen = G.gen
)
union
(
select ID, Distance, Arrivaltime from yourtable where ArrivalTime is not null
)
As a result, I obtained the following interpolation...
ID Distance ArrivalTime 24 0.289 2000-01-01 06:29:00.000 25 0.193 2000-01-01 06:29:24.000 26 0.299 2000-01-01 06:29:37.000 27 0.131 2000-01-01 06:29:16.000 28 0.444 2000-01-01 06:29:56.000 29 0.160 2000-01-01 06:29:20.000 30 0.665 2000-01-01 06:30:24.000 31 0.186 2000-01-01 06:33:00.000
The execution time went from 10 hours to 30 minutes and the database
(in fact the table) was not in exclusive mode anymore (conccurent
queries could be ran during the execution of the
interlpolation)...Most of all, no more time outs were to be noted and
we had the assurance that the process would not stop at the 9th hour.
This example establishes once more the superiority of set oriented operations over procedural approaches but it also brings some questions...If we can interpolate values from relations, would not we imagine a TRDBMS with a systematic method (based on interpolation) to deal with missing numeric/datetime data. Would not that be a possible systematic method to handle missing data in the spirit of what Codd imagined (I would give a lot to be able to ask him this question if he was alive)...
I would be grateful for your comments, ideas on this subject.... Received on Sun Apr 29 2007 - 12:04:03 CEST