One-to-many relationship, non-equality join criteria. Can't come up with a decent solution

From: nippur <dberge_at_mx2.redestb.es>
Date: 3 Mar 2003 10:09:30 -0800
Message-ID: <ee93444d.0303031009.12e94730_at_posting.google.com>



Hi,

   I can't seem to come up with an efficient solution for the following problem:

BACKGROUND


   I have data taken from a driving monitor and data from a journey log, it consists of information such as vehicle and driver ids, date and time of start and end of a trip, and other things. The goal is to relate them both, but there are the following complications:

 1.- start/end times in the driving monitor and journey log don't match exactly.

 2.- there isn't a one-to-one correspondence between driving monitor and journey log records, as the monitor usually resets itself after a short period (e.g., a rest stop in the middle of a journey), and then starts a new record, but that still comes up as one entry in the journey log (we'll ignore the case where one driving monitor record comes under two log entries, for simplicity).

so the idea is to match a driving monitor record with the latest journey started just before the DM's start time (ignoring for the moment the fact that the log timestamp could be later than the monitor's)

   Graphically, a typical case could be represented a bit like this:

Journey Log:         |-----------------------------------------------|
                                            (A)

Driving Monitor: |------------| |------| |------| |-------------|

                        (1)            (2)       (3)          (4)

   The horizontal axis represents time, the lines represent JL or DM entries, the letters/numbers in parentheses would be their respective IDs. In this case the idea is to end a with a relation where (A) links to (1), (2), (3), & (4), and nothing else.

THE DATA


   Here's a sketch of my schema:

dr_monitor


vehicle_id (pk)
odometer   (pk)
dt_start   (idx)

...

j_log



journey_id (pk)
vehicle_id
dt_start (idx)
...

relation


vehicle_id (pk / fk ref. dr_monitor.vehicle_id)
odometer   (pk / fk ref. dr_monitor.odometer)
journey_id (pk / fk ref. j_log.journey_id)

I tried something like:

INSERT INTO relation (vehicle_id, odometer, journey_id) SELECT m.vehicle_id, odometer, journey_id FROM dr_monitor m, j_log l
WHERE m.vehicle_id = l.vehicle_id

      AND l.dt_start = (SELECT MAX(dt_start) FROM j_log WHERE j_log.dt_start <= dr_monitor.dt_start);

but a) the RDBMS I will be using (MySQL) doesn't support subqueries, and b) from testing in MS-Access with a reduced data set, it took painfully long.

then I also tried a procedural approach:

  • START PSEUDO-CODE ------
$dataset1 = do_query("SELECT vehicle_id, odometer, dt_start
                      FROM dr_monitor;");

foreach($row in $dataset1)
{
  $dataset2 = do_query("SELECT journey_id

                        FROM j_log
                        WHERE vehicle_id = $row['vehicle_id']
                              AND dt_start <= $row['dt_start']
                        ORDER BY dt_start DESC

  $results[] = fetch_row($dataset2);
}

  // Run INSERT query with the data from $results

  • END PSEUDO-CODE ------
   However, it didn't help much execution time-wise (there 26000 journey logs and over 200000 driving monitor records).

   Has anybody come across a similar case before? I've run out of ideas for the moment, so I would be interested in any ideas, suggestions, comments from you people.

   TIA. Regards,
Diego Berge.

-- 
NOTE: e-mail address is not valid.  Please reply to newsgroup.
(a Google search will come up with valid e-mail addresses if you need
to contact me privately)
Received on Mon Mar 03 2003 - 19:09:30 CET

Original text of this message