Re: Analytics question maybe - combine over lapping date rows
Date: Fri, 27 Jun 2008 18:50:32 -0700 (PDT)
Message-ID: <1b2d7ee1-607e-4d0a-892e-8b70d89dda5b@e53g2000hsa.googlegroups.com>
On Jun 27, 12:01 pm, "stephen O'D" <stephen.odonn..._at_gmail.com> wrote:
> On Oracle 10G, lets say I have a table that looks like this:
>
> create table t1 (key varchar2(10), start_dtm date, end_dtm date);
>
> and I have data some of which over laps:
>
> insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'),
> to_date('30/03/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'),
> to_date('30/03/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'),
> to_date('15/04/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'),
> to_date('30/12/2001', 'DD/MM/YYYY'));
>
> Ie
> 1st Jan - 30th March
> 2nd Feb - 30th March
> 15th Feb - 15th April
>
> 1st July - 30th Dec
>
> I want to return only two rows from the 4 that look like
>
> 1st jan - 15th April
> 1st July - 30th Dec
>
> ie - if the dates are continuous, squash all the continuous rows into
> 1 with the min start and max end date
> but if there is a gap it has to become a second row. There can be any
> number of rows and over laps etc.
>
> I reckon this can be done with some analytics trickery, but I just
> cannot seem to figure out how.
>
> Any ideas?
>
> Thanks,
>
> Stephen.
Thanks for posting the DDL and DML to create the test setup. I see that Maxim provided a solution for you in the 90 minutes that I experimented with the problem (I initially read the requirement wrong). Maxim's solution appears to be more elegant than what follows (and there may still be a couple issues), but it might be interesting to look at how a solution might be worked out.
First, we experiment with LAG so that we are able to examine the previous row when sorted by START_DTM and END_DTM: SELECT
T1.KEY, T1.START_DTM, T1.END_DTM,
LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM; KEY START_DTM END_DTM M_START_D M_END_DTM
---------- --------- --------- --------- --------- 1234 01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01 1234 01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01 1234 15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01 1234 01-JUL-01 30-DEC-01
If we now slide the above into an inline view and see if the START_DTM
is greater than M_END_DTM (the previous row's END_DTM), we may have
found a new series of dates (note, I just noticed that this may have
problems if there are two rows with discrete date ranges that fall
entirely within a third, larger date range), we will output 1 if true,
or 0 if false:
SELECT
KEY,
START_DTM,
END_DTM,
M_START_DTM,
M_END_DTM,
DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
(SELECT
T1.KEY, T1.START_DTM, T1.END_DTM,
LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM); KEY START_DTM END_DTM M_START_D M_END_DTM C
---------- --------- --------- --------- --------- ---------- 1234 01-JAN-01 30-MAR-01 0 1234 01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01 0 1234 15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01 0 1234 01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01 1
In the above, a 1 is output whenever there is a jump in the date range
- if we use the COUNT analytical function to create a running count of
the 1s and slide the above into an inline view, we actually create a
column that may be used for grouping:
SELECT
KEY,
START_DTM START_DTM,
END_DTM END_DTM,
SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
KEY,
START_DTM,
END_DTM,
M_START_DTM,
M_END_DTM,
DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
(SELECT
T1.KEY, T1.START_DTM, T1.END_DTM,
LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM)); KEY START_DTM END_DTM G
---------- --------- --------- ---------- 1234 01-JAN-01 30-MAR-01 0 1234 01-FEB-01 30-MAR-01 0 1234 15-FEB-01 15-APR-01 0 1234 01-JUL-01 30-DEC-01 1
We are now able to group on the column G by again sliding the above
into an inline view:
SELECT
KEY,
MIN(START_DTM) START_DTM,
MAX(END_DTM) END_DTM
FROM
(SELECT
KEY,
START_DTM START_DTM,
END_DTM END_DTM,
SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
KEY,
START_DTM,
END_DTM,
M_START_DTM,
M_END_DTM,
DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
(SELECT
T1.KEY, T1.START_DTM, T1.END_DTM,
LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM)))
GROUP BY
KEY,
G;
KEY START_DTM END_DTM
---------- --------- --------- 1234 01-JAN-01 15-APR-01 1234 01-JUL-01 30-DEC-01
Let's add a little more data to see what happens:
insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'),
to_date('29/12/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'),
to_date('15/01/2002', 'DD/MM/YYYY'));
KEY START_DTM END_DTM
---------- --------- --------- 1234 01-JAN-01 15-APR-01 1234 01-JUL-01 30-DEC-01 1234 31-DEC-01 15-JAN-02
Note the addition of the last row in the output - you may want to determine if that should be part of the previous group of date. You should be able to fix this by adding or subtracting a date in the SIGN( ) function.
I originally thought that you were trying to eliminate this row from
the output:
KEY START_DTM END_DTM
---------- --------- ---------
1234 01-FEB-01 30-MAR-01
For the above, there are several methods:
SELECT
T1.KEY, T1.START_DTM, T1.END_DTM
FROM
T1
WHERE
(T1.KEY,
T1.START_DTM,
T1.END_DTM) NOT IN (
SELECT
T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM)); KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT
T1.KEY, T1.START_DTM, T1.END_DTM
FROM
T1,
(SELECT
T1.KEY, T1.START_DTM, T1.END_DTM
FROM
T1,
T1 T2
WHERE
T1.KEY=T2.KEY
AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND (
T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM)) T3
WHERE
T1.KEY=T3.KEY(+)
AND T1.START_DTM=T3.START_DTM(+) AND T1.END_DTM=T3.END_DTM(+) AND T3.KEY IS NULL; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT
T1.KEY, T1.START_DTM, T1.END_DTM
FROM
T1
MINUS
SELECT
T1.KEY, T1.START_DTM, T1.END_DTM
FROM
T1,
T1 T2
WHERE
T1.KEY=T2.KEY
AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND (
T1.START_DTM<>T2.START_DTM
OR T1.END_DTM<>T2.END_DTM);
KEY START_DTM END_DTM
---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT
KEY,
START_DTM,
END_DTM
FROM
(SELECT
T1.KEY, T1.START_DTM, T1.END_DTM,
MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM) M_START_DTM,
MAX(T1.END_DTM) OVER (PARTITION BY T1.KEY, T1.START_DTM) M_END_DTM
FROM
T1)
WHERE
START_DTM=M_START_DTM
AND END_DTM=M_END_DTM; KEY START_DTM END_DTM
---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
Charles Hooper
IT Manager/Oracle DBA
K&M Machine-Fabricating, Inc.
Received on Fri Jun 27 2008 - 20:50:32 CDT