Re: Analytics question maybe - combine over lapping date rows

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: Fri, 27 Jun 2008 18:50:32 -0700 (PDT)
Message-ID: <1b2d7ee1-607e-4d0a-892e-8b70d89dda5b@e53g2000hsa.googlegroups.com>


On Jun 27, 12:01 pm, "stephen O'D" <stephen.odonn..._at_gmail.com> wrote:
> On Oracle 10G, lets say I have a table that looks like this:
>
> create table t1 (key varchar2(10), start_dtm date, end_dtm date);
>
> and I have data some of which over laps:
>
> insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'),
> to_date('30/03/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'),
> to_date('30/03/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'),
> to_date('15/04/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'),
> to_date('30/12/2001', 'DD/MM/YYYY'));
>
> Ie
> 1st Jan - 30th March
>     2nd Feb - 30th March
>        15th Feb - 15th April
>
> 1st July - 30th Dec
>
> I want to return only two rows from the 4 that look like
>
> 1st jan - 15th April
> 1st July - 30th Dec
>
> ie - if the dates are continuous, squash all the continuous rows into
> 1 with the min start and max end date
> but if there is a gap it has to become a second row.  There can be any
> number of rows and over laps etc.
>
> I reckon this can be done with some analytics trickery, but I just
> cannot seem to figure out how.
>
> Any ideas?
>
> Thanks,
>
> Stephen.

Thanks for posting the DDL and DML to create the test setup. I see that Maxim provided a solution for you in the 90 minutes that I experimented with the problem (I initially read the requirement wrong). Maxim's solution appears to be more elegant than what follows (and there may still be a couple issues), but it might be interesting to look at how a solution might be worked out.

First, we experiment with LAG so that we are able to examine the previous row when sorted by START_DTM and END_DTM: SELECT

  T1.KEY,
  T1.START_DTM,
  T1.END_DTM,

  LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
  LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
  T1
ORDER BY
  T1.START_DTM,
  T1.END_DTM; KEY START_DTM END_DTM M_START_D M_END_DTM
---------- --------- --------- --------- ---------
1234       01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01
1234       01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01
1234       15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01
1234       01-JUL-01 30-DEC-01

If we now slide the above into an inline view and see if the START_DTM is greater than M_END_DTM (the previous row's END_DTM), we may have found a new series of dates (note, I just noticed that this may have problems if there are two rows with discrete date ranges that fall entirely within a third, larger date range), we will output 1 if true, or 0 if false:
SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM
  (SELECT

    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,

    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM); KEY START_DTM END_DTM M_START_D M_END_DTM C
---------- --------- --------- --------- --------- ----------
1234       01-JAN-01 30-MAR-01                              0
1234       01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01          0
1234       15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01          0
1234       01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01          1

In the above, a 1 is output whenever there is a jump in the date range - if we use the COUNT analytical function to create a running count of the 1s and slide the above into an inline view, we actually create a column that may be used for grouping:
SELECT
  KEY,
  START_DTM START_DTM,
  END_DTM END_DTM,
  SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G FROM
(SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM
  (SELECT

    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,

    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM)); KEY START_DTM END_DTM G
---------- --------- --------- ----------
1234       01-JAN-01 30-MAR-01          0
1234       01-FEB-01 30-MAR-01          0
1234       15-FEB-01 15-APR-01          0
1234       01-JUL-01 30-DEC-01          1

We are now able to group on the column G by again sliding the above into an inline view:
SELECT
  KEY,
  MIN(START_DTM) START_DTM,
  MAX(END_DTM) END_DTM
FROM
(SELECT
  KEY,
  START_DTM START_DTM,
  END_DTM END_DTM,
  SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G FROM
(SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM
  (SELECT

    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,

    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM)))
GROUP BY
  KEY,
  G;

KEY START_DTM END_DTM

---------- --------- ---------
1234       01-JAN-01 15-APR-01
1234       01-JUL-01 30-DEC-01

Let's add a little more data to see what happens: insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'), to_date('29/12/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'), to_date('15/01/2002', 'DD/MM/YYYY'));

KEY START_DTM END_DTM

---------- --------- ---------
1234       01-JAN-01 15-APR-01
1234       01-JUL-01 30-DEC-01
1234       31-DEC-01 15-JAN-02

Note the addition of the last row in the output - you may want to determine if that should be part of the previous group of date. You should be able to fix this by adding or subtracting a date in the SIGN( ) function.

I originally thought that you were trying to eliminate this row from the output:
KEY START_DTM END_DTM
---------- --------- ---------
1234 01-FEB-01 30-MAR-01 For the above, there are several methods: SELECT

  T1.KEY,
  T1.START_DTM,
  T1.END_DTM

FROM
  T1
WHERE
  (T1.KEY,
  T1.START_DTM,
  T1.END_DTM) NOT IN (
    SELECT
      T1.KEY,
      T1.START_DTM,
      T1.END_DTM
    FROM
      T1,
      T1 T2
    WHERE
      T1.KEY=T2.KEY
      AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
      AND T1.END_DTM BETWEEN  T2.START_DTM AND T2.END_DTM
      AND (
        T1.START_DTM<>T2.START_DTM
        OR T1.END_DTM<>T2.END_DTM));

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01


SELECT

  T1.KEY,
  T1.START_DTM,
  T1.END_DTM

FROM
  T1,
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM

  FROM
    T1,
    T1 T2
  WHERE
    T1.KEY=T2.KEY
    AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM     AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM     AND (
      T1.START_DTM<>T2.START_DTM
      OR T1.END_DTM<>T2.END_DTM)) T3

WHERE
  T1.KEY=T3.KEY(+)
  AND T1.START_DTM=T3.START_DTM(+)
  AND T1.END_DTM=T3.END_DTM(+)
  AND T3.KEY IS NULL;

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01


SELECT

  T1.KEY,
  T1.START_DTM,
  T1.END_DTM

FROM
  T1
MINUS
SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM

FROM
  T1,
  T1 T2
WHERE
  T1.KEY=T2.KEY
  AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM   AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM   AND (
    T1.START_DTM<>T2.START_DTM
    OR T1.END_DTM<>T2.END_DTM);

KEY START_DTM END_DTM

---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01

SELECT
  KEY,
  START_DTM,
  END_DTM
FROM
  (SELECT

    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,

    MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM) M_START_DTM,
    MAX(T1.END_DTM) OVER (PARTITION BY T1.KEY, T1.START_DTM) M_END_DTM
  FROM
    T1)
WHERE
  START_DTM=M_START_DTM
  AND END_DTM=M_END_DTM; KEY START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01

Charles Hooper
IT Manager/Oracle DBA
K&M Machine-Fabricating, Inc. Received on Fri Jun 27 2008 - 20:50:32 CDT

Original text of this message