Re: Analytics question maybe - combine over lapping date rows

From: stephen O'D <stephen.odonnell_at_gmail.com>
Date: Tue, 1 Jul 2008 08:30:09 -0700 (PDT)
Message-ID: <a8287b19-cefd-4b37-9c69-700dc890a2a5@k30g2000hse.googlegroups.com>


On Jun 28, 2:50 am, Charles Hooper <hooperc2..._at_yahoo.com> wrote:
> On Jun 27, 12:01 pm, "stephen O'D" <stephen.odonn..._at_gmail.com> wrote:
>
>
>
> > On Oracle 10G, lets say I have a table that looks like this:
>
> > create table t1 (key varchar2(10), start_dtm date, end_dtm date);
>
> > and I have data some of which over laps:
>
> > insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'),
> > to_date('30/03/2001', 'DD/MM/YYYY'));
> > insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'),
> > to_date('30/03/2001', 'DD/MM/YYYY'));
> > insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'),
> > to_date('15/04/2001', 'DD/MM/YYYY'));
> > insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'),
> > to_date('30/12/2001', 'DD/MM/YYYY'));
>
> > Ie
> > 1st Jan - 30th March
> > 2nd Feb - 30th March
> > 15th Feb - 15th April
>
> > 1st July - 30th Dec
>
> > I want to return only two rows from the 4 that look like
>
> > 1st jan - 15th April
> > 1st July - 30th Dec
>
> > ie - if the dates are continuous, squash all the continuous rows into
> > 1 with the min start and max end date
> > but if there is a gap it has to become a second row. There can be any
> > number of rows and over laps etc.
>
> > I reckon this can be done with some analytics trickery, but I just
> > cannot seem to figure out how.
>
> > Any ideas?
>
> > Thanks,
>
> > Stephen.
>
> Thanks for posting the DDL and DML to create the test setup. I see
> that Maxim provided a solution for you in the 90 minutes that I
> experimented with the problem (I initially read the requirement
> wrong). Maxim's solution appears to be more elegant than what follows
> (and there may still be a couple issues), but it might be interesting
> to look at how a solution might be worked out.
>
> First, we experiment with LAG so that we are able to examine the
> previous row when sorted by START_DTM and END_DTM:
> SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM,
> LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_START_DTM,
> LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_END_DTM
> FROM
> T1
> ORDER BY
> T1.START_DTM,
> T1.END_DTM;
>
> KEY START_DTM END_DTM M_START_D M_END_DTM
> ---------- --------- --------- --------- ---------
> 1234 01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01
> 1234 01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01
> 1234 15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01
> 1234 01-JUL-01 30-DEC-01
>
> If we now slide the above into an inline view and see if the START_DTM
> is greater than M_END_DTM (the previous row's END_DTM), we may have
> found a new series of dates (note, I just noticed that this may have
> problems if there are two rows with discrete date ranges that fall
> entirely within a third, larger date range), we will output 1 if true,
> or 0 if false:
> SELECT
> KEY,
> START_DTM,
> END_DTM,
> M_START_DTM,
> M_END_DTM,
> DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
> FROM
> (SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM,
> LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_START_DTM,
> LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_END_DTM
> FROM
> T1
> ORDER BY
> T1.START_DTM,
> T1.END_DTM);
>
> KEY START_DTM END_DTM M_START_D M_END_DTM C
> ---------- --------- --------- --------- --------- ----------
> 1234 01-JAN-01 30-MAR-01 0
> 1234 01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01 0
> 1234 15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01 0
> 1234 01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01 1
>
> In the above, a 1 is output whenever there is a jump in the date range
> - if we use the COUNT analytical function to create a running count of
> the 1s and slide the above into an inline view, we actually create a
> column that may be used for grouping:
> SELECT
> KEY,
> START_DTM START_DTM,
> END_DTM END_DTM,
> SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
> FROM
> (SELECT
> KEY,
> START_DTM,
> END_DTM,
> M_START_DTM,
> M_END_DTM,
> DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
> FROM
> (SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM,
> LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_START_DTM,
> LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_END_DTM
> FROM
> T1
> ORDER BY
> T1.START_DTM,
> T1.END_DTM));
>
> KEY START_DTM END_DTM G
> ---------- --------- --------- ----------
> 1234 01-JAN-01 30-MAR-01 0
> 1234 01-FEB-01 30-MAR-01 0
> 1234 15-FEB-01 15-APR-01 0
> 1234 01-JUL-01 30-DEC-01 1
>
> We are now able to group on the column G by again sliding the above
> into an inline view:
> SELECT
> KEY,
> MIN(START_DTM) START_DTM,
> MAX(END_DTM) END_DTM
> FROM
> (SELECT
> KEY,
> START_DTM START_DTM,
> END_DTM END_DTM,
> SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
> FROM
> (SELECT
> KEY,
> START_DTM,
> END_DTM,
> M_START_DTM,
> M_END_DTM,
> DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
> FROM
> (SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM,
> LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_START_DTM,
> LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
> T1.START_DTM,T1.END_DTM) M_END_DTM
> FROM
> T1
> ORDER BY
> T1.START_DTM,
> T1.END_DTM)))
> GROUP BY
> KEY,
> G;
>
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-JAN-01 15-APR-01
> 1234 01-JUL-01 30-DEC-01
>
> Let's add a little more data to see what happens:
> insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'),
> to_date('29/12/2001', 'DD/MM/YYYY'));
> insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'),
> to_date('15/01/2002', 'DD/MM/YYYY'));
>
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-JAN-01 15-APR-01
> 1234 01-JUL-01 30-DEC-01
> 1234 31-DEC-01 15-JAN-02
>
> Note the addition of the last row in the output - you may want to
> determine if that should be part of the previous group of date. You
> should be able to fix this by adding or subtracting a date in the
> SIGN( ) function.
>
> I originally thought that you were trying to eliminate this row from
> the output:
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-FEB-01 30-MAR-01
>
> For the above, there are several methods:
> SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM
> FROM
> T1
> WHERE
> (T1.KEY,
> T1.START_DTM,
> T1.END_DTM) NOT IN (
> SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM
> FROM
> T1,
> T1 T2
> WHERE
> T1.KEY=T2.KEY
> AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
> AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM
> AND (
> T1.START_DTM<>T2.START_DTM
> OR T1.END_DTM<>T2.END_DTM));
>
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-JAN-01 30-MAR-01
> 1234 15-FEB-01 15-APR-01
> 1234 01-JUL-01 30-DEC-01
>
> SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM
> FROM
> T1,
> (SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM
> FROM
> T1,
> T1 T2
> WHERE
> T1.KEY=T2.KEY
> AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
> AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM
> AND (
> T1.START_DTM<>T2.START_DTM
> OR T1.END_DTM<>T2.END_DTM)) T3
> WHERE
> T1.KEY=T3.KEY(+)
> AND T1.START_DTM=T3.START_DTM(+)
> AND T1.END_DTM=T3.END_DTM(+)
> AND T3.KEY IS NULL;
>
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-JAN-01 30-MAR-01
> 1234 15-FEB-01 15-APR-01
> 1234 01-JUL-01 30-DEC-01
>
> SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM
> FROM
> T1
> MINUS
> SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM
> FROM
> T1,
> T1 T2
> WHERE
> T1.KEY=T2.KEY
> AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
> AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM
> AND (
> T1.START_DTM<>T2.START_DTM
> OR T1.END_DTM<>T2.END_DTM);
>
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-JAN-01 30-MAR-01
> 1234 15-FEB-01 15-APR-01
> 1234 01-JUL-01 30-DEC-01
>
> SELECT
> KEY,
> START_DTM,
> END_DTM
> FROM
> (SELECT
> T1.KEY,
> T1.START_DTM,
> T1.END_DTM,
> MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM)
> M_START_DTM,
> MAX(T1.END_DTM) OVER (PARTITION BY T1.KEY, T1.START_DTM)
> M_END_DTM
> FROM
> T1)
> WHERE
> START_DTM=M_START_DTM
> AND END_DTM=M_END_DTM;
>
> KEY START_DTM END_DTM
> ---------- --------- ---------
> 1234 01-JAN-01 30-MAR-01
> 1234 15-FEB-01 15-APR-01
> 1234 01-JUL-01 30-DEC-01
>
> Charles Hooper
> IT Manager/Oracle DBA
> K&M Machine-Fabricating, Inc.

Guys - thanks for the replies on this. I eventually worked it out too with a third solution after dredging AskTom and the replies here ... Here is my solution to complete the picture:

Using the same table as above and this data:

create table t1 (key varchar2(10), start_dtm date, end_dtm date);

insert into t1 values(
9999, to_date('01/01/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY'));

insert into t1 values(
9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY'));

insert into t1 values(
9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/05/2008', 'DD/MM/YYYY'));

insert into t1 values(
9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/07/2008', 'DD/MM/YYYY'));

This gives a set of rows like:

|------------------------------------------|
   |---------------------------------------|
   |---------------------|
      |------------------------------------------|

And we expect this to turn into a single row like

01/01/08 |----------------------------------------------| 01/07/2008

insert into t1 values(
9999, to_date('01/08/2008', 'DD/MM/YYYY'), null);

insert into t1 values(
9999, to_date('01/09/2008', 'DD/MM/YYYY'), to_date('01/10/2008', 'DD/MM/YYYY'));

|--------------------------------------->

   |------------|

And we want an row that looks like

01/08/2008 |----------------------------> (null end date)

I think this query does it:

select key, min(start_dtm), max(end_dtm) from
(
  select key, start_dtm, end_dtm, max (grp) over (partition by key order by start_dtm asc) grp2
  from
  (
    SELECT key,

           start_dtm,
           NVL (end_dtm, '31-dec-4712') end_dtm,
           case
             when
               (start_dtm not between lag(start_dtm) over (partition
by key order by start_dtm asc)
                   and nvl( lag(end_dtm) over (partition by key order
by start_dtm asc), '31-dec-4712') )
                or lag(start_dtm) over (partition by key order by
start_dtm asc) is null
                then
                  row_number() over (partition by key order by
start_dtm asc)
             end grp

    FROM t1
    order by key, start_dtm asc
  ) a
) b
group by key, grp2
GRP                              MIN(START MAX(END_D
--------------------------------- --------- ---------
09999                             01-JAN-08 01-JUL-08
09999                             01-AUG-08 31-DEC-12

2 rows selected.

Basically I said, if you order the rows by increasing start date, then if rows overlap, the current start_dtm must be between the previous start_dtm and end_dtm - if there is no previous start_dtm or they don't overlap, then its a new group.

The innermost select outputs something like:

KEY                               START_DTM END_DTM          GRP
--------------------------------- --------- --------- ----------
09999                             01-JAN-08 01-JUN-08          1
09999                             01-FEB-08 01-JUL-08
09999                             01-FEB-08 01-JUN-08
09999                             01-FEB-08 01-MAY-08
09999                             01-AUG-08 31-DEC-12          5
09999                             01-SEP-08 01-OCT-08

Then we use analytics again to fill in the blanks in the enclosing query (b):

KEY                               START_DTM END_DTM         GRP2
--------------------------------- --------- --------- ----------
09999                             01-JAN-08 01-JUN-08          1
09999                             01-FEB-08 01-JUL-08          1
09999                             01-FEB-08 01-JUN-08          1
09999                             01-FEB-08 01-MAY-08          1
09999                             01-AUG-08 31-DEC-12          5
09999                             01-SEP-08 01-OCT-08          5


 and then simply group by KEY, GRP in the outer query to get the result.

I am glad I went through the pain of figuring this out (took me a good 90 minutes) as I really feel like I get analytics now. It also allowed me to replace a temporary table and several 100 lines of PLSQL that was performing rather poorly and written by someone who doesn't get analytics!

Thanks again,

Stephen. Received on Tue Jul 01 2008 - 10:30:09 CDT

Original text of this message