Re: SQL: Working with huge tables of chronological data

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: 8 Apr 2007 08:03:51 -0700
Message-ID: <1176044631.189804.86380@y66g2000hsf.googlegroups.com>

On Apr 8, 4:08 am, "charely" <nos..._at_skynet.be> wrote:
> "John" <acide.ascorbi..._at_gmail.com> schreef in berichtnews:1175788930.059725.19220_at_e65g2000hsc.googlegroups.com...
> > Here are the technical details and the query I've been using so far.
>
> > TableA is ~100 millions row and contains (timestamp, evtA)
> > TableB is ~30 millions row and contains (timestamp, evtB)
>
> > The following query took ~60h (on a private but quite slow server) to
> > compute. ~1h is what I'm aiming to.
>
> > select TA1_B.evtA, TA2.evtA
> > from
> > (
> > select TA1.evtA, TA1.timestamp timeA1, TB.evtB, min(TB.timestamp)
> > min_timeB
> > from tableA TA1 left outer join tableB TB on (TA1.timestamp <
> > TB.timestamp)
> > group by TA1.evtA, TA1.timestamp, TB.evtB
> > ) TA1_B,
> > tableA TA2
> > where
> > TA1_B.timeA1 < TA2.timestamp
> > and (TA2.timestamp < TA1_B.min_timeB or TA1_B.min_timeB is null)
> > and TA1_B.evtA <> TA2.evtA;
>
> > Thanks!
>
> > John
>
> What about
>
> select ta1.timestamp , ta2.timestamp from tablea ta1 , tablea ta2
> where ta2.timestamp > ta1.timestamp and ta2.timestamp <=
> nvl((select min(timestamp) from tableb b where b.timestamp >
> ta1.timestamp) ,
> (select max(timestamp) from tablea))
>
> The nvl function is only needed to also catch events where no later event
> in b exist.
>
> I have not tested this for performance , but assuming you have indexes on
> the timestamp
> columns ( or using IOTs for the tables) , the optimizer will probably use
> range scans on
> those indexes.

I don't that that the SQL statement you provided will offer the data that the OP was wanting. Assume the following: TABLEA (T1)

05-APR-2007 00:07:00
05-APR-2007 00:10:00
05-APR-2007 00:12:00
05-APR-2007 00:15:00
05-APR-2007 00:17:00
05-APR-2007 00:20:00

TABLEB (T2)
05-APR-2007 00:09:00
05-APR-2007 00:18:00 The OP wanted to retrieve pairs of data from two rows of TABLEA (T1) where a value from TABLEB (T2) does not fall between the pairs from TABLEA (T1). In this case, we need to report all pairs of values between the time values 05-APR-2007 00:10:00, 05-APR-2007 00:12:00, 05- APR-2007 00:15:00, and 05-APR-2007 00:17:00 since those fall between the two values from TABLEB (T2). That should yield the following

list:
  00:10-00:12
  00:12-00:15
  00:10-00:15
  00:15-00:17
  00:12-00:17
  00:10-00:17

Let's reformat the query that you provided so that it can use the sample tables and indexes that I provided previously, in order to see if it is an efficient starting point for the OP: SELECT /*+ GATHER_PLAN_STATISTICS */
  TA1.V1,
  TA2.V1
FROM
  T1 TA1,
  T2 TA2
WHERE
  TA2.V1 > TA1.V1
  AND TA2.V1 <= NVL(
    (SELECT
      MIN(V1)
    FROM
      T2 B
    WHERE
      B.V1 > TA1.V1),
    (SELECT
      MAX(V1)
    FROM
      T1));

SELECT
*
FROM
TABLE (DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

| Id  | Operation                      | Name    | Starts | E-Rows | A-

|*  1 |  FILTER                        |         |      1 |        |
1000 |00:00:00.47 |      79 |       |       |          |
|   2 |   MERGE JOIN                   |         |      1 |  15000
|    167K|00:00:01.17 |       8 |       |       |          |
|   3 |    SORT JOIN                   |         |      1 |    300
|    300 |00:00:00.01 |       1 | 11264 | 11264 |10240  (0)|
|   4 |     INDEX FULL SCAN            | T2_IND1 |      1 |    300
|    300 |00:00:00.01 |       1 |       |       |          |
|*  5 |    SORT JOIN                   |         |    300 |   1000
|    167K|00:00:00.50 |       7 | 36864 | 36864 |32768  (0)|
|   6 |     INDEX FAST FULL SCAN       | T1_IND1 |      1 |   1000 |
1000 |00:00:00.01 |       7 |       |       |          |
|   7 |   SORT AGGREGATE               |         |  71341 |      1 |
71341 |00:00:02.09 |      69 |       |       |          |
|   8 |    FIRST ROW                   |         |  71341 |     15 |
71341 |00:00:01.21 |      69 |       |       |          |
|*  9 |     INDEX RANGE SCAN (MIN/MAX) | T2_IND1 |  71341 |     15 |
71341 |00:00:00.47 |      69 |       |       |          |
|  10 |      SORT AGGREGATE            |         |      1 |      1
|      1 |00:00:00.01 |       2 |       |       |          |
|  11 |       INDEX FULL SCAN (MIN/MAX)| T1_IND1 |      1 |   1000
|      1 |00:00:00.01 |       2 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

1 - filter("TA2"."V1"<=NVL(,))
5 -
access(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))

filter(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))

9 - access("B"."V1">:B1)

Notice the Starts column in the above - that is the number of times that portion of the plan was executed. Also note the significant difference between estimated and actual rows, as well as what is reported in the actual time column. Where did that bind variable in the predicate information for ID 9 come from (Oracle optimization)?

Here is the plan for the final SQL statement that I provided, just for comparison:

| Id  | Operation                             | Name    | Starts | E-

|*  1 |  CONNECT BY WITHOUT FILTERING         |         |      1
|        |   1652 |00:00:00.06 |      23 |     12 |       |
|          |
|   2 |   VIEW                                |         |      1 |
1035 |   1194 |00:00:00.07 |      23 |     12 |       |
|          |
|   3 |    WINDOW NOSORT                      |         |      1 |
1035 |   1194 |00:00:00.06 |      23 |     12 | 73728 | 73728
|          |
|   4 |     VIEW                              |         |      1 |
1035 |   1194 |00:00:00.06 |      23 |     12 |       |
|          |
|   5 |      WINDOW SORT                      |         |      1 |
1035 |   1194 |00:00:00.05 |      23 |     12 | 43008 | 43008 |38912
(0)|
|   6 |       VIEW                            |         |      1 |
1035 |   1194 |00:00:00.05 |      23 |     12 |       |
|          |
|   7 |        SORT ORDER BY                  |         |      1 |
1035 |   1194 |00:00:00.05 |      23 |     12 | 57344 | 57344 |51200
(0)|
|   8 |         VIEW                          |         |      1 |
1035 |   1194 |00:00:00.07 |      23 |     12 |       |
|          |
|   9 |          UNION-ALL                    |         |      1
|        |   1194 |00:00:00.06 |      23 |     12 |       |
|          |
|* 10 |           HASH JOIN RIGHT OUTER       |         |      1 |
1000 |   1000 |00:00:00.05 |      14 |     12 |   898K|   898K| 1132K
(0)|
|  11 |            TABLE ACCESS FULL          | T2      |      1 |
300 |    300 |00:00:00.04 |       7 |      6 |       |
|          |
|  12 |            TABLE ACCESS FULL          | T1      |      1 |
1000 |   1000 |00:00:00.01 |       7 |      6 |       |
|          |
|  13 |           MERGE JOIN ANTI             |         |      1 |
35 |    194 |00:00:00.01 |       9 |      0 |       |       |
|
|  14 |            TABLE ACCESS BY INDEX ROWID| T2      |      1 |
300 |    300 |00:00:00.01 |       2 |      0 |       |
|          |
|  15 |             INDEX FULL SCAN           | T2_IND1 |      1 |
300 |    300 |00:00:00.01 |       1 |      0 |       |
|          |
|* 16 |            SORT UNIQUE                |         |    300 |
1000 |    106 |00:00:00.01 |       7 |      0 | 36864 | 36864 |32768
(0)|
|  17 |             INDEX FAST FULL SCAN      | T1_IND1 |      1 |
1000 |   1000 |00:00:00.01 |       7 |      0 |       |
|          |

Predicate Information (identified by operation id):

   1 - access("TS"+1=PRIOR NULL)
  10 - access("T1"."V1"="T2"."V1")
  16 - access("T1"."V1"="T2"."V1")
       filter("T1"."V1"="T2"."V1")

If nothing else, this shows how efficient analytic functions can be when compared to other methods.

Regarding Mladen Gogala suggestion to use LAG(V1,1), LAG(V1,2), LAG(V1,3), ... LAG(V1, n) - I looked at using that method initially. The problems that I encountered: #1 what should the value of n be so that I do not miss any potential matches, #2 how do I create a new result row for each of the matches, since all of the matches will be returned on the same result row. In other words, you will only be able to return one set of matches per result row when using LAG, unless you can find some way to uncoil all of the resulting matches into new result rows. That may be the reason that Jonathan Lewis stated "the option using the analytic lag() would only give you the adjacent pairs."

I would also suggest not retrieving all rows from the two tables (TableA is ~100 millions rows, TableB is ~30 millions rows) and processing the data client side, since you would then also need to consider latency caused by network traffic when the 130 million rows are retrieved from the database and sent to the client for processing.

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc. Received on Sun Apr 08 2007 - 10:03:51 CDT