Re: One-to-many relationship, non-equality join criteria. Can't come up with a decent solution

From: nippur <dberge_at_mx2.redestb.es>
Date: 4 Mar 2003 05:29:15 -0800
Message-ID: <ee93444d.0303040529.5328ac7d_at_posting.google.com>


71062.1056_at_compuserve.com (--CELKO--) wrote in message news:<c0d87ec0.0303031718.2216694e_at_posting.google.com>...

> Some real DDL would have been better, easier to read and testable (if
> you also gave sample data.

   Apologies for that, I've uploaded some sample data and the schema to <http://www.nippur.net/dberge/db-example.tar.gz>. Here's the schema:

CREATE TABLE trips (

    vehicle_id INTEGER NOT NULL,
    odometer DECIMAL(8,1) NOT NULL,
    driver_id INTEGER NOT NULL,
    distance DECIMAL(8,1) NOT NULL,
    dt_start TIMESTAMP NOT NULL,
    dt_end TIMESTAMP NOT NULL,
    time_mobile INTEGER NOT NULL, -- seconds when speed <> 0     top_speed DECIMAL(8,1) NOT NULL,
    overspeed INTEGER NOT NULL, -- in seconds     accel INTEGER, -- # of sudden accelerations     decel INTEGER, -- # of sudden brakings     PRIMARY KEY(vehicle_id, odometer)
);

CREATE TABLE journeys (

    journey_id INTEGER NOT NULL,
    dt_start TIMESTAMP NOT NULL,
    dt_end TIMESTAMP -- NULL = in progress     vehicle_id INTEGER NOT NULL,
    driver_id INTEGER NOT NULL, -- not redundant, see below     pax INTEGER NOT NULL,
    origin VARCHAR(64) NOT NULL,
    destination VARCHAR(64) NOT NULL,
    eta TIMESTAMP NOT NULL,
    remarks VARCHAR(128),
    user VARCHAR(32) NOT NULL, -- journey log custodian     PRIMARY KEY(journey_id)
);

  • in table "trips", driver_id is data entered in the
  • driving monitor by the driver. In "journeys", driver_id
  • is entered by an operator that confirms the identity of
  • the driver. We need both to find mismatches.

CREATE TABLE relation (

    vehicle_id INTEGER NOT NULL,
    odometer INTEGER NOT NULL,
    journey_id INTEGER NOT NULL,
    PRIMARY KEY(vehicle_id, odometer, journey_id) );

[....]
> UPDATE MonitorReports
> SET log_nbr
> = (SELECT DISTINCT log_nbr
> FROM Log AS L1
> WHERE MonitorReports.start_time
> BETWEEN L1.start_time AND L1.end_time
> OR MonitorReports.end_time
> BETWEEN L1.start_time AND L1.end_time);
>
> Problems:
>
> Journey Log:
> |--------------------------------------------| |-----|
> (A) (B)
> Driving Monitor:
> |------------| |------| |------| |-------------|
> (1) (2) (3) (4)
>
> 0) We can see (2) and (3) are clearly part of (A).
>
> 1) Can trip (4) be in both log (A) and (B)?

   Yes, it can (does) happen. If driver_id is the same for trips (A) and (B), then (4) should be related to both, otherwise it's ignored. But this is a minor concern--we may as well ignore (4) althogether for the time being.

> 2) Can trip (1) start before (A) and still be counted as part of (A)?

   Yes. That could be solved by replacing journeys.dt_start with (journeys.dt_start - _at_epsilon), where @epsilon is a suitable value (in our case, could be 10-15 minutes)

> 3) Or can a trip belong to two or more log entries?

   Ideally, one trip would belong to exactly one log entry. A log entry, OTOH, will normally have many related trips.

 4) There is also the case where:

Journey Log:

 |---------------|                            |---------------|
       (A)                                           (B)

Driving Monitor:
  |--------| |------| |--------| |-----| |------| |---|

      (1) (2) (3) (4) (5) (6)

   In this case, journeys (1), (2) belong to (A), (5), (6) belong to (B). What about (3) and (4)? They belong to (A), because that's the latest log entry for that vehicle.

So the general condition would be something like

   journeys.vehicle_id = trips.vehicle_id AND journeys.dt_start = (SELECT MAX(dt_start) FROM journeys WHERE dt_start <= trips.dt_start)

which, in a previous post, John Gilson expressed as

>
> SELECT M.vehicle_id, M.odometer, L1.journey_id
> FROM DrivingMonitor AS M
> INNER JOIN
> JourneyLog AS L1
> ON M.vehicle_id = L1.vehicle_id AND
> L1.dt_start <= M.dt_start
> LEFT OUTER JOIN
> JourneyLog AS L2
> ON L2.vehicle_id = L1.vehicle_id AND
> L2.dt_start > L1.dt_start AND
> L2.dt_start <= M.dt_start
> WHERE L2.vehicle_id IS NULL
>
>

   However, the above query has been running for just over two hours now on a dataset of ~26000 logs and ~4000 monitor records. I suppose I could have limited the logs to the (shorter) time range covered by the monitor records, but it's still incredibly slow.

   This problem has been bugging me for well over a year now, but I still cannot seem to find a good way of doing it. Maybe it's time to find me another job :)

   Seriously, though, thanks to those who have replied so far, and I appreciate any and all comments and ideas on this matter.

Regards,
Diego Berge. Received on Tue Mar 04 2003 - 14:29:15 CET

Original text of this message