Re: A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation

From: Alvin Ryder <alvin321_at_telstra.com>
Date: 3 May 2007 20:40:20 -0700
Message-ID: <1178250020.400561.182170_at_h2g2000hsg.googlegroups.com>


On Apr 29, 8:04 pm, Cimode <cim..._at_hotmail.com> wrote:
> Among the least explored opportunities RM has to offer comes the issue
> of numerical / time series linear interpolation of values. Recently,
> I have solved the following problem on a community board which I
> encountered this problem a few years ago with different facts while
> working for the Frech railroad company. I thought about using the
> recent example (better for pedagogical purposes) to illustrate the
> work done a few years ago.
>
> To make sure the heavier traffic lines were weel maintained and fit
> for high speed service, each 2 weeks, the company launched trains
> equipped with receptors that ran across lines and recorded some data
> such as passage time and distance. The data was later pourred in a
> statistical database for analysis and decision were made that some
> line required additional maintaince (replacement of line segment).
>
> Without getting into too much detail, the process consists of the
> following segment of reality captured: A train goes from a start train
> station at a specific start time and ends its ittinerary in another
> station at a end time. Between these stations, the train crosses
> several stations where some data was *not* collected for some
> reason(mainly the receptor was not working). As a consequence, on
> direct image systems (SQL Server, ORACLE or DB2), NULL pointers were
> placed instead of the values of passage time and a row was to be added
> at each station passage. So, we have a *train_passage* table like
> this (I supposed a SK ID)
>
> ID Distance ArrivalTime
> 24 0.289 2000-01-01 06:29:00.000
> 25 0.193 NULL --> 1st value to be interpolated
> 26 0.299 NULL --> 2nd value to be interpolated
> 27 0.131 NULL --> 3rd value to be interpolated
> 28 0.444 NULL --> 4th value to be interpolated
> 29 0.16 NULL --> 5th value to be interpolated
> 30 0.665 NULL --> 6th value to be interpolated
> 31 0.186 2000-01-01 06:33:00.000
>
> The point of the problem was to interpolate all NULL ArrivalTime
> values, based on the following formula.(A static linear formula is
> used for the moment and there are probably better formulas but the
> point of the problem is not about the efficiency of the formula):
>
> ArrivalTime for ID = 25:
> (Distance for 25 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
> 31-ArrivalTime for ID 24)
> ArrivalTime for ID = 26:
> (Distance for 26 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
> 31-ArrivalTime for ID 24)
>
> and so forth....
>
> In most cases I have encountered, practionners and developpers use a
> procedural approach (namely cursors) to solve this problem. For each
> row with unknown datetime values, they executed a function. As a
> consequence on several million records and because each row performs
> heavy calculation, the resource consumtion simply grows
> exponentially. (Mainly because of non separation between the physical
> and logical layer in direct image systems). So, every two weeks, they
> had to run a 10 hour batch process to interpolate the missing data.
> In pseudo code the initial procedural looked like this:
>
> FOR ALL ROWS
> START
> IF ROW HAD ALL DATA
> STORE NEW START TIME
> STORE NEW END TIME
> WRITE ALL COMPUTED INTERPOLATED VALUE FROM
> PREVIOUS SEGMENT OF LINE INTO FINAL INTERPOLATED TABLE
> WRITE NEW ROW IN FINAL INTERPOLATED TABLE
> END IF
>
> IF ROW HAD MISSING DATA
> COMPUTE INTERPOLATED DATA
> STORE COMPUTED VALUE IN TEMP
> END IF
>
> GO TO NEXT ROW DATA
> END
>
> SHOW ALL VALUES IN FINAL INTERPOLATED TABLE
>
> Because of the procedural approach,the process locked up the database
> in exclusive mode and prevented any concurrent process and had to be
> launched several time because it sometime simply timed out due to a
> lack of resources. This is an example that demonstrate that set
> operations are light years from procedural approaches. I reexpressed
> the interpolation as a complex UNION operation between two relations
> and ended up with following SQL statement....(note: this is T-SQL,
> dateadd is a function that adds a certain numer of time perdiod to a
> specific datetime)
>
> (
> select E.id, E.Distance, dateadd(s, (E.Distance/G.distance) * G.gen,
> E.start) arrivaltime
> from yourtable F inner join
> (
> select A.ID, A.Distance, B.arrivaltime start,
> datediff(s,B.arrivaltime, C.arrivaltime) as gen
> from yourtable A
> left outer join
> (
> select id, distance, arrivaltime from yourtable
> where ArrivalTime is not null
> ) B
> on B.id < A.id
> left outer join
> (
> select id, distance, arrivaltime from yourtable
> where ArrivalTime is not null
> ) C
> on C.id > A.id
> where A.arrivaltime is null
> ) E
> on E.ID = F.ID
>
> left outer join
> (
> select sum(A.Distance) distance, datediff(s,B.arrivaltime,
> C.arrivaltime) as gen
> from yourtable A
> left outer join
> (
> select id, distance, arrivaltime from yourtable
> where ArrivalTime is not null
> ) B
> on B.id < A.id
> left outer join
> (
> select id, distance, arrivaltime from yourtable
> where ArrivalTime is not null
> ) C
> on C.id > A.id
> where A.arrivaltime is null
> group by
> datediff(s,B.arrivaltime, C.arrivaltime)
> ) G
> on E.gen = G.gen
> )
> union
> (
> select ID, Distance, Arrivaltime from yourtable where ArrivalTime is
> not null
> )
>
> As a result, I obtained the following interpolation...
>
> ID Distance ArrivalTime
> 24 0.289 2000-01-01 06:29:00.000
> 25 0.193 2000-01-01 06:29:24.000
> 26 0.299 2000-01-01 06:29:37.000
> 27 0.131 2000-01-01 06:29:16.000
> 28 0.444 2000-01-01 06:29:56.000
> 29 0.160 2000-01-01 06:29:20.000
> 30 0.665 2000-01-01 06:30:24.000
> 31 0.186 2000-01-01 06:33:00.000
>
> The execution time went from 10 hours to 30 minutes and the database
> (in fact the table) was not in exclusive mode anymore (conccurent
> queries could be ran during the execution of the
> interlpolation)...Most of all, no more time outs were to be noted and
> we had the assurance that the process would not stop at the 9th hour.
>
> This example establishes once more the superiority of set oriented
> operations over procedural approaches but it also brings some
> questions...If we can interpolate values from relations, would not we
> imagine a TRDBMS with a systematic method (based on interpolation) to
> deal with missing numeric/datetime data. Would not that be a possible
> systematic method to handle missing data in the spirit of what Codd
> imagined (I would give a lot to be able to ask him this question if he
> was alive)...
>
> I would be grateful for your comments, ideas on this subject....

Cimode you rewrote some queries on that particular database and gained a performance improvement, I'm sure that makes you feel good but you cannot conclude a proof.

Your "proof" would have to be true across different queries, different databases, different locking mechanisms, languages, developers, schemas ...

I also have plenty of cases where a procedural approach yielded a 200 fold improvement but at other times a set based one did the same. I cannot prove apples are better fruit than oranges.

BTW locking is not driven by procedural versus set based processing and each database vendor implements different mechanisms. For instance, many times Oracle will not have to lock anything and yet it'll get the correct results while other vendors must lock.

As for missing data, that depends on each domain/type. The RM allows for domain/type specific operators that make sense for a given domain, so you could have interpolation operators and they could even be implemented inside the db, will your solution be faster then? Maybe, maybe not.

Besides what does superior mean, performance, developer time, maintainability, higher concurrency...?

Cheers. Received on Fri May 04 2007 - 05:40:20 CEST

Original text of this message