Re: SQL: Working with huge tables of chronological data

From: charely <nospam_at_skynet.be>
Date: Sun, 8 Apr 2007 22:45:42 +0200
Message-ID: <46195468$0$13860$ba620e4c@news.skynet.be>

"Charles Hooper" <hooperc2000_at_yahoo.com> schreef in bericht news:1176044631.189804.86380_at_y66g2000hsf.googlegroups.com...
> On Apr 8, 4:08 am, "charely" <nos..._at_skynet.be> wrote:

>> "John" <acide.ascorbi..._at_gmail.com> schreef in 
>> berichtnews:1175788930.059725.19220_at_e65g2000hsc.googlegroups.com...
>> > Here are the technical details and the query I've been using so far.
>>
>> > TableA is ~100 millions row and contains (timestamp, evtA)
>> > TableB is ~30 millions row and contains (timestamp, evtB)
>>
>> > The following query took ~60h (on a private but quite slow server) to
>> > compute. ~1h is what I'm aiming to.
>>
>> > select TA1_B.evtA, TA2.evtA
>> > from
>> >                (
>> > select  TA1.evtA, TA1.timestamp timeA1, TB.evtB, min(TB.timestamp)
>> > min_timeB
>> > from tableA TA1 left outer join tableB TB on (TA1.timestamp <
>> > TB.timestamp)
>> > group by TA1.evtA, TA1.timestamp, TB.evtB
>> > ) TA1_B,
>> > tableA TA2
>> > where
>> > TA1_B.timeA1 < TA2.timestamp
>> > and (TA2.timestamp < TA1_B.min_timeB or TA1_B.min_timeB is null)
>> > and TA1_B.evtA <> TA2.evtA;
>>
>> > Thanks!
>>
>> > John
>>
>> What about
>>
>>  select  ta1.timestamp , ta2.timestamp from tablea ta1 , tablea ta2
>>  where ta2.timestamp > ta1.timestamp and ta2.timestamp <=
>>  nvl((select min(timestamp) from tableb b where  b.timestamp >
>> ta1.timestamp) ,
>>  (select max(timestamp) from tablea))
>>
>>  The nvl  function is only needed to also catch events where no later 
>> event
>> in b exist.
>>
>>  I have not tested this for performance , but assuming you have indexes 
>> on
>> the timestamp
>>  columns ( or using IOTs for the tables) , the optimizer will probably 
>> use
>> range scans on
>>  those indexes.
>

> I don't that that the SQL statement you provided will offer the data
> that the OP was wanting. Assume the following:
> TABLEA (T1)
> 05-APR-2007 00:07:00
> 05-APR-2007 00:10:00
> 05-APR-2007 00:12:00
> 05-APR-2007 00:15:00
> 05-APR-2007 00:17:00
> 05-APR-2007 00:20:00

> TABLEB (T2)
> 05-APR-2007 00:09:00
> 05-APR-2007 00:18:00

> The OP wanted to retrieve pairs of data from two rows of TABLEA (T1)
> where a value from TABLEB (T2) does not fall between the pairs from
> TABLEA (T1). In this case, we need to report all pairs of values
> between the time values 05-APR-2007 00:10:00, 05-APR-2007 00:12:00, 05-
> APR-2007 00:15:00, and 05-APR-2007 00:17:00 since those fall between
> the two values from TABLEB (T2). That should yield the following
> list:
> 00:10-00:12
> 00:12-00:15
> 00:10-00:15
> 00:15-00:17
> 00:12-00:17
> 00:10-00:17

> Let's reformat the query that you provided so that it can use the
> sample tables and indexes that I provided previously, in order to see
> if it is an efficient starting point for the OP:
> SELECT /*+ GATHER_PLAN_STATISTICS */
> TA1.V1,
> TA2.V1
> FROM
> T1 TA1,
> T2 TA2
> WHERE
> TA2.V1 > TA1.V1
> AND TA2.V1 <= NVL(
> (SELECT
> MIN(V1)
> FROM
> T2 B
> WHERE
> B.V1 > TA1.V1),
> (SELECT
> MAX(V1)
> FROM
> T1));

> SELECT
> *
> FROM
> TABLE (DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

>


> | Id  | Operation                      | Name    | Starts | E-Rows | A-

> Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
> -------------------------------------------------------------------------------------------------------------------------------
> |* 1 | FILTER | | 1 | |
> 1000 |00:00:00.47 | 79 | | | |
> | 2 | MERGE JOIN | | 1 | 15000
> | 167K|00:00:01.17 | 8 | | | |
> | 3 | SORT JOIN | | 1 | 300
> | 300 |00:00:00.01 | 1 | 11264 | 11264 |10240 (0)|
> | 4 | INDEX FULL SCAN | T2_IND1 | 1 | 300
> | 300 |00:00:00.01 | 1 | | | |
> |* 5 | SORT JOIN | | 300 | 1000
> | 167K|00:00:00.50 | 7 | 36864 | 36864 |32768 (0)|
> | 6 | INDEX FAST FULL SCAN | T1_IND1 | 1 | 1000 |
> 1000 |00:00:00.01 | 7 | | | |
> | 7 | SORT AGGREGATE | | 71341 | 1 |
> 71341 |00:00:02.09 | 69 | | | |
> | 8 | FIRST ROW | | 71341 | 15 |
> 71341 |00:00:01.21 | 69 | | | |
> |* 9 | INDEX RANGE SCAN (MIN/MAX) | T2_IND1 | 71341 | 15 |
> 71341 |00:00:00.47 | 69 | | | |
> | 10 | SORT AGGREGATE | | 1 | 1
> | 1 |00:00:00.01 | 2 | | | |
> | 11 | INDEX FULL SCAN (MIN/MAX)| T1_IND1 | 1 | 1000
> | 1 |00:00:00.01 | 2 | | | |
> -------------------------------------------------------------------------------------------------------------------------------

> Predicate Information (identified by operation id):
> ---------------------------------------------------
> 1 - filter("TA2"."V1"<=NVL(,))
> 5 -
> access(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))

> filter(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
> 9 - access("B"."V1">:B1)

> Notice the Starts column in the above - that is the number of times
> that portion of the plan was executed. Also note the significant
> difference between estimated and actual rows, as well as what is
> reported in the actual time column. Where did that bind variable in
> the predicate information for ID 9 come from (Oracle optimization)?

> Here is the plan for the final SQL statement that I provided, just for
> comparison:
> | Id | Operation | Name | Starts | E-
> Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-
> Mem |
> -----------------------------------------------------------------------------------------------------------------------------------------------
> |* 1 | CONNECT BY WITHOUT FILTERING | | 1
> | | 1652 |00:00:00.06 | 23 | 12 | |
> | |
> | 2 | VIEW | | 1 |
> 1035 | 1194 |00:00:00.07 | 23 | 12 | |
> | |
> | 3 | WINDOW NOSORT | | 1 |
> 1035 | 1194 |00:00:00.06 | 23 | 12 | 73728 | 73728
> | |
> | 4 | VIEW | | 1 |
> 1035 | 1194 |00:00:00.06 | 23 | 12 | |
> | |
> | 5 | WINDOW SORT | | 1 |
> 1035 | 1194 |00:00:00.05 | 23 | 12 | 43008 | 43008 |38912
> (0)|
> | 6 | VIEW | | 1 |
> 1035 | 1194 |00:00:00.05 | 23 | 12 | |
> | |
> | 7 | SORT ORDER BY | | 1 |
> 1035 | 1194 |00:00:00.05 | 23 | 12 | 57344 | 57344 |51200
> (0)|
> | 8 | VIEW | | 1 |
> 1035 | 1194 |00:00:00.07 | 23 | 12 | |
> | |
> | 9 | UNION-ALL | | 1
> | | 1194 |00:00:00.06 | 23 | 12 | |
> | |
> |* 10 | HASH JOIN RIGHT OUTER | | 1 |
> 1000 | 1000 |00:00:00.05 | 14 | 12 | 898K| 898K| 1132K
> (0)|
> | 11 | TABLE ACCESS FULL | T2 | 1 |
> 300 | 300 |00:00:00.04 | 7 | 6 | |
> | |
> | 12 | TABLE ACCESS FULL | T1 | 1 |
> 1000 | 1000 |00:00:00.01 | 7 | 6 | |
> | |
> | 13 | MERGE JOIN ANTI | | 1 |
> 35 | 194 |00:00:00.01 | 9 | 0 | | |
> |
> | 14 | TABLE ACCESS BY INDEX ROWID| T2 | 1 |
> 300 | 300 |00:00:00.01 | 2 | 0 | |
> | |
> | 15 | INDEX FULL SCAN | T2_IND1 | 1 |
> 300 | 300 |00:00:00.01 | 1 | 0 | |
> | |
> |* 16 | SORT UNIQUE | | 300 |
> 1000 | 106 |00:00:00.01 | 7 | 0 | 36864 | 36864 |32768
> (0)|
> | 17 | INDEX FAST FULL SCAN | T1_IND1 | 1 |
> 1000 | 1000 |00:00:00.01 | 7 | 0 | |
> | |

> Predicate Information (identified by operation id):
> ---------------------------------------------------
> 1 - access("TS"+1=PRIOR NULL)
> 10 - access("T1"."V1"="T2"."V1")
> 16 - access("T1"."V1"="T2"."V1")
> filter("T1"."V1"="T2"."V1")

> If nothing else, this shows how efficient analytic functions can be
> when compared to other methods.

> Regarding Mladen Gogala suggestion to use LAG(V1,1), LAG(V1,2),
> LAG(V1,3), ... LAG(V1, n) - I looked at using that method initially.
> The problems that I encountered: #1 what should the value of n be so
> that I do not miss any potential matches, #2 how do I create a new
> result row for each of the matches, since all of the matches will be
> returned on the same result row. In other words, you will only be
> able to return one set of matches per result row when using LAG,
> unless you can find some way to uncoil all of the resulting matches
> into new result rows. That may be the reason that Jonathan Lewis
> stated "the option using the analytic lag() would only give you the
> adjacent pairs."

> I would also suggest not retrieving all rows from the two tables
> (TableA is ~100 millions rows, TableB is ~30 millions rows) and
> processing the data client side, since you would then also need to
> consider latency caused by network traffic when the 130 million rows
> are retrieved from the database and sent to the client for processing.

> Charles Hooper
> PC Support Specialist
> K&M Machine-Fabricating, Inc.

>

As far as I can tell, the provided query does match the OP's request, but I do agree that the CBO has generated a poor execution plan. I got a better plan with

select /*+USE_NL(ta2 ta1)*/ ta1.v1 , ta2.v1 from t1 ta1 , t1 ta2 where ta2.v1 > ta1.v1 and ta2.v1 <=
nvl((select min(v1) from t2 b where b.v1 > ta1.v1) ,

(select max(v1) from t1))

| Id  | Operation                       | Name  | Rows  | Bytes | Cost 
(%CPU)| Time     |

-----------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                |       |    25 |   650 |  7514K

(1)| 25:02:57 |

|   1 |  NESTED LOOPS                   |       |   249K|  6346K| 20258

(2)| 00:04:04 |

| 2 | TABLE ACCESS FULL | T1 | 9999 | 126K| 9 (0)| 00:00:01 |

|* 3 | INDEX RANGE SCAN | IX_T1 | 25 | 325 | 2 (0)| 00:00:01 |

|   4 |    SORT AGGREGATE               |       |     1 |    13 | 
|          |

|   5 |     FIRST ROW                   |       |   500 |  6500 |     2

(0)| 00:00:01 |

|* 6 | INDEX RANGE SCAN (MIN/MAX) | IX_T2 | 500 | 6500 | 2 (0)| 00:00:01 |

|   7 |       SORT AGGREGATE            |       |     1 |    13 | 
|          |

|   8 |        INDEX FULL SCAN (MIN/MAX)| IX_T1 |  9999 |   126K| 
|          |

-----------------------------------------------------------------------------------------

the plan still does two index range scans for every row in T1 ,

but avoids sorting the union of T1 and T2. Received on Sun Apr 08 2007 - 15:45:42 CDT