Re: Working with huge tables of chronological data

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Mon, 9 Apr 2007 11:23:43 +0100
Message-ID: <FN2dnelHbc44iYfbnZ2dnUVZ8tmhnZ2d@bt.com>

"John" <acide.ascorbique_at_gmail.com> wrote in message news:1175871313.425881.201500_at_n76g2000hsh.googlegroups.com...
>
> Chronology
> (03:50pm, A1)
> (03:51pm, B1)
> (03:55pm, A2)
> (03:58pm, A3)
> (03:59pm, A4)
> (04:00pm, B2)
>
> I'm looking for all the sequences of events A in the chronology with
> no B event in the middle. The results would be:
> (A2, A3) ; (A3, A4) and also (A2, A4) even if A2 and A4 are not
> consecutive.
>

I may have missed know if anyone has come

From the above, it looks as if you need the following query (in principle). The hints assume an index on ta(timed) and tb(timed)

select

         /*+ ordered use_nl(ta2) index(ta1) index(ta2) */
         ta1.timed, ta1.event,
         ta2.timed, ta2.event
from
         ta ta1,
         ta ta2
where
         ta2.timed > ta1.timed
and ta2.timed < (
           select /*+ index(tb) no_unnest */
            min(tb.timed)
           from tb
           where tb.timed > ta1.timed

)
order by

ta1.timed
;

You need to join TA to itself because a single row in TA can produce a result set of unknown size.
The condition for generating a row is that another row in TA should have a larger time stamp, which does not exceed the next occurrence from TB.
I haven't yet considered problems relating to non-unique values for timestamp, and timestamps in TA matching timestamps in TB

If you can create indexes on the (timed) column, then the plan for the query above gets as close to the simple COBOL merge (described by other posters) in action and performance as a non-procedural statement could.

If you can't create the necessary indexes, then the code probably has to sort TA twice, and TB once which means that for large data sets it probably won't be able to match the non-procedural option.

Execution plan (9i and 10g)

| Id  | Operation                       |  Name       | Rows  | Bytes |

Cost |

|   0 | SELECT STATEMENT                |             | 10090 |   216K| 
204K|
|   1 |  TABLE ACCESS BY INDEX ROWID    | TA          |     5 |    55 | 
101 |
|   2 |   NESTED LOOPS                  |             | 10090 |   216K| 
204K|
|   3 |    TABLE ACCESS BY INDEX ROWID  | TA          |  2009 | 22099 | 
2007 |
|   4 |     INDEX FULL SCAN             | TA_U1       |  2009 |       | 
7 |
|*  5 |    INDEX RANGE SCAN             | TA_U1       |   100 |       | 
1 |
|   6 |     SORT AGGREGATE              |             |     1 |     7 | 
|
|   7 |      FIRST ROW                  |             |     1 |     7 | 
1 |
|*  8 |       INDEX RANGE SCAN (MIN/MAX)| TB_U1       |     3 |       |

1 |

Predicate Information (identified by operation id):

   5 - access("TA2"."TIMED">"SYS_ALIAS_1"."TIMED" AND "TA2"."TIMED"< (SELECT
              /*+ NO_UNNEST INDEX("TB") */ MIN("TB"."TIMED") FROM "TB" "TB" WHERE
              "TB"."TIMED">:B1))
   8 - access("TB"."TIMED">:B1)

The plan walks the index on TA in order fetching rows.

For each row it gets index entries from TA where the timestamp exceeds the current timestamp - but because the subquery to TB is used as part of the access predicate you know that the range scan on the index is the minimum range scan. Because the subquery on the index on TB uses range scan (min/max) this too will be minimal.

The costs appear as they do simply because of the amount of data I happened to create in the table over and above the sample set shown.

-- 
Regards

Jonathan Lewis
http://jonathanlewis.wordpress.com

Author: Cost Based Oracle: Fundamentals
http://www.jlcomp.demon.co.uk/cbo_book/ind_book.html

The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html

Received on Mon Apr 09 2007 - 05:23:43 CDT