Re: SQL: Working with huge tables of chronological data

From: Maxim Demenko <mdemenko_at_gmail.com>
Date: Mon, 09 Apr 2007 00:25:07 +0200
Message-ID: <46196BC3.5000108@gmail.com>

charely schrieb:

> "Charles Hooper" <hooperc2000_at_yahoo.com> schreef in bericht 
> news:1176044631.189804.86380_at_y66g2000hsf.googlegroups.com...

>> On Apr 8, 4:08 am, "charely" <nos..._at_skynet.be> wrote:

>>> "John" <acide.ascorbi..._at_gmail.com> schreef in 
>>> berichtnews:1175788930.059725.19220_at_e65g2000hsc.googlegroups.com...
>>>> Here are the technical details and the query I've been using so far.
>>>> TableA is ~100 millions row and contains (timestamp, evtA)
>>>> TableB is ~30 millions row and contains (timestamp, evtB)
>>>> The following query took ~60h (on a private but quite slow server) to
>>>> compute. ~1h is what I'm aiming to.
>>>> select TA1_B.evtA, TA2.evtA
>>>> from
>>>>                (
>>>> select  TA1.evtA, TA1.timestamp timeA1, TB.evtB, min(TB.timestamp)
>>>> min_timeB
>>>> from tableA TA1 left outer join tableB TB on (TA1.timestamp <
>>>> TB.timestamp)
>>>> group by TA1.evtA, TA1.timestamp, TB.evtB
>>>> ) TA1_B,
>>>> tableA TA2
>>>> where
>>>> TA1_B.timeA1 < TA2.timestamp
>>>> and (TA2.timestamp < TA1_B.min_timeB or TA1_B.min_timeB is null)
>>>> and TA1_B.evtA <> TA2.evtA;
>>>> Thanks!
>>>> John
>>> What about
>>>
>>>  select  ta1.timestamp , ta2.timestamp from tablea ta1 , tablea ta2
>>>  where ta2.timestamp > ta1.timestamp and ta2.timestamp <=
>>>  nvl((select min(timestamp) from tableb b where  b.timestamp >
>>> ta1.timestamp) ,
>>>  (select max(timestamp) from tablea))
>>>
>>>  The nvl  function is only needed to also catch events where no later 
>>> event
>>> in b exist.
>>>
>>>  I have not tested this for performance , but assuming you have indexes 
>>> on
>>> the timestamp
>>>  columns ( or using IOTs for the tables) , the optimizer will probably 
>>> use
>>> range scans on
>>>  those indexes.

>> I don't that that the SQL statement you provided will offer the data
>> that the OP was wanting. Assume the following:
>> TABLEA (T1)
>> 05-APR-2007 00:07:00
>> 05-APR-2007 00:10:00
>> 05-APR-2007 00:12:00
>> 05-APR-2007 00:15:00
>> 05-APR-2007 00:17:00
>> 05-APR-2007 00:20:00

>>

>> TABLEB (T2)
>> 05-APR-2007 00:09:00
>> 05-APR-2007 00:18:00

>>

>> The OP wanted to retrieve pairs of data from two rows of TABLEA (T1)
>> where a value from TABLEB (T2) does not fall between the pairs from
>> TABLEA (T1). In this case, we need to report all pairs of values
>> between the time values 05-APR-2007 00:10:00, 05-APR-2007 00:12:00, 05-
>> APR-2007 00:15:00, and 05-APR-2007 00:17:00 since those fall between
>> the two values from TABLEB (T2). That should yield the following
>> list:
>> 00:10-00:12
>> 00:12-00:15
>> 00:10-00:15
>> 00:15-00:17
>> 00:12-00:17
>> 00:10-00:17

>>

>> Let's reformat the query that you provided so that it can use the
>> sample tables and indexes that I provided previously, in order to see
>> if it is an efficient starting point for the OP:
>> SELECT /*+ GATHER_PLAN_STATISTICS */
>> TA1.V1,
>> TA2.V1
>> FROM
>> T1 TA1,
>> T2 TA2
>> WHERE
>> TA2.V1 > TA1.V1
>> AND TA2.V1 <= NVL(
>> (SELECT
>> MIN(V1)
>> FROM
>> T2 B
>> WHERE
>> B.V1 > TA1.V1),
>> (SELECT
>> MAX(V1)
>> FROM
>> T1));

>>

>> SELECT
>> *
>> FROM
>> TABLE (DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

>>


>> | Id  | Operation                      | Name    | Starts | E-Rows | A-

>> Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
>> -------------------------------------------------------------------------------------------------------------------------------
>> |* 1 | FILTER | | 1 | |
>> 1000 |00:00:00.47 | 79 | | | |
>> | 2 | MERGE JOIN | | 1 | 15000
>> | 167K|00:00:01.17 | 8 | | | |
>> | 3 | SORT JOIN | | 1 | 300
>> | 300 |00:00:00.01 | 1 | 11264 | 11264 |10240 (0)|
>> | 4 | INDEX FULL SCAN | T2_IND1 | 1 | 300
>> | 300 |00:00:00.01 | 1 | | | |
>> |* 5 | SORT JOIN | | 300 | 1000
>> | 167K|00:00:00.50 | 7 | 36864 | 36864 |32768 (0)|
>> | 6 | INDEX FAST FULL SCAN | T1_IND1 | 1 | 1000 |
>> 1000 |00:00:00.01 | 7 | | | |
>> | 7 | SORT AGGREGATE | | 71341 | 1 |
>> 71341 |00:00:02.09 | 69 | | | |
>> | 8 | FIRST ROW | | 71341 | 15 |
>> 71341 |00:00:01.21 | 69 | | | |
>> |* 9 | INDEX RANGE SCAN (MIN/MAX) | T2_IND1 | 71341 | 15 |
>> 71341 |00:00:00.47 | 69 | | | |
>> | 10 | SORT AGGREGATE | | 1 | 1
>> | 1 |00:00:00.01 | 2 | | | |
>> | 11 | INDEX FULL SCAN (MIN/MAX)| T1_IND1 | 1 | 1000
>> | 1 |00:00:00.01 | 2 | | | |
>> -------------------------------------------------------------------------------------------------------------------------------

>>

>> Predicate Information (identified by operation id):
>> ---------------------------------------------------
>> 1 - filter("TA2"."V1"<=NVL(,))
>> 5 -
>> access(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))

>>

>> filter(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
>> 9 - access("B"."V1">:B1)

>>

>> Notice the Starts column in the above - that is the number of times
>> that portion of the plan was executed. Also note the significant
>> difference between estimated and actual rows, as well as what is
>> reported in the actual time column. Where did that bind variable in
>> the predicate information for ID 9 come from (Oracle optimization)?

>>

>> Here is the plan for the final SQL statement that I provided, just for
>> comparison:
>> | Id | Operation | Name | Starts | E-
>> Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-
>> Mem |
>> -----------------------------------------------------------------------------------------------------------------------------------------------
>> |* 1 | CONNECT BY WITHOUT FILTERING | | 1
>> | | 1652 |00:00:00.06 | 23 | 12 | |
>> | |
>> | 2 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.07 | 23 | 12 | |
>> | |
>> | 3 | WINDOW NOSORT | | 1 |
>> 1035 | 1194 |00:00:00.06 | 23 | 12 | 73728 | 73728
>> | |
>> | 4 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.06 | 23 | 12 | |
>> | |
>> | 5 | WINDOW SORT | | 1 |
>> 1035 | 1194 |00:00:00.05 | 23 | 12 | 43008 | 43008 |38912
>> (0)|
>> | 6 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.05 | 23 | 12 | |
>> | |
>> | 7 | SORT ORDER BY | | 1 |
>> 1035 | 1194 |00:00:00.05 | 23 | 12 | 57344 | 57344 |51200
>> (0)|
>> | 8 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.07 | 23 | 12 | |
>> | |
>> | 9 | UNION-ALL | | 1
>> | | 1194 |00:00:00.06 | 23 | 12 | |
>> | |
>> |* 10 | HASH JOIN RIGHT OUTER | | 1 |
>> 1000 | 1000 |00:00:00.05 | 14 | 12 | 898K| 898K| 1132K
>> (0)|
>> | 11 | TABLE ACCESS FULL | T2 | 1 |
>> 300 | 300 |00:00:00.04 | 7 | 6 | |
>> | |
>> | 12 | TABLE ACCESS FULL | T1 | 1 |
>> 1000 | 1000 |00:00:00.01 | 7 | 6 | |
>> | |
>> | 13 | MERGE JOIN ANTI | | 1 |
>> 35 | 194 |00:00:00.01 | 9 | 0 | | |
>> |
>> | 14 | TABLE ACCESS BY INDEX ROWID| T2 | 1 |
>> 300 | 300 |00:00:00.01 | 2 | 0 | |
>> | |
>> | 15 | INDEX FULL SCAN | T2_IND1 | 1 |
>> 300 | 300 |00:00:00.01 | 1 | 0 | |
>> | |
>> |* 16 | SORT UNIQUE | | 300 |
>> 1000 | 106 |00:00:00.01 | 7 | 0 | 36864 | 36864 |32768
>> (0)|
>> | 17 | INDEX FAST FULL SCAN | T1_IND1 | 1 |
>> 1000 | 1000 |00:00:00.01 | 7 | 0 | |
>> | |

>>

>> Predicate Information (identified by operation id):
>> ---------------------------------------------------
>> 1 - access("TS"+1=PRIOR NULL)
>> 10 - access("T1"."V1"="T2"."V1")
>> 16 - access("T1"."V1"="T2"."V1")
>> filter("T1"."V1"="T2"."V1")

>>

>> If nothing else, this shows how efficient analytic functions can be
>> when compared to other methods.

>>

>> Regarding Mladen Gogala suggestion to use LAG(V1,1), LAG(V1,2),
>> LAG(V1,3), ... LAG(V1, n) - I looked at using that method initially.
>> The problems that I encountered: #1 what should the value of n be so
>> that I do not miss any potential matches, #2 how do I create a new
>> result row for each of the matches, since all of the matches will be
>> returned on the same result row. In other words, you will only be
>> able to return one set of matches per result row when using LAG,
>> unless you can find some way to uncoil all of the resulting matches
>> into new result rows. That may be the reason that Jonathan Lewis
>> stated "the option using the analytic lag() would only give you the
>> adjacent pairs."

>>

>> I would also suggest not retrieving all rows from the two tables
>> (TableA is ~100 millions rows, TableB is ~30 millions rows) and
>> processing the data client side, since you would then also need to
>> consider latency caused by network traffic when the 130 million rows
>> are retrieved from the database and sent to the client for processing.

>>

>> Charles Hooper
>> PC Support Specialist
>> K&M Machine-Fabricating, Inc.

>>
> 
>  As far as I can tell, the provided query  does match the OP's request,
>   but I do agree that the CBO has generated a poor execution plan.
>   I got a better plan with
> 
>  select  /*+USE_NL(ta2 ta1)*/ ta1.v1 , ta2.v1 from t1 ta1 , t1 ta2
>  where ta2.v1 > ta1.v1 and ta2.v1 <=
>  nvl((select min(v1) from t2 b where  b.v1 > ta1.v1) ,
>        (select max(v1) from t1))
> 
> 
> -----------------------------------------------------------------------------------------
> 
> | Id  | Operation                       | Name  | Rows  | Bytes | Cost 
> (%CPU)| Time     |
> 
> -----------------------------------------------------------------------------------------
> 
> |   0 | SELECT STATEMENT                |       |    25 |   650 |  7514K 
> (1)| 25:02:57 |
> 
> |   1 |  NESTED LOOPS                   |       |   249K|  6346K| 20258 
> (2)| 00:04:04 |
> 
> |   2 |   TABLE ACCESS FULL             | T1    |  9999 |   126K|     9 
> (0)| 00:00:01 |
> 
> |*  3 |   INDEX RANGE SCAN              | IX_T1 |    25 |   325 |     2 
> (0)| 00:00:01 |
> 
> |   4 |    SORT AGGREGATE               |       |     1 |    13 | 
> |          |
> 
> |   5 |     FIRST ROW                   |       |   500 |  6500 |     2 
> (0)| 00:00:01 |
> 
> |*  6 |      INDEX RANGE SCAN (MIN/MAX) | IX_T2 |   500 |  6500 |     2 
> (0)| 00:00:01 |
> 
> |   7 |       SORT AGGREGATE            |       |     1 |    13 | 
> |          |
> 
> |   8 |        INDEX FULL SCAN (MIN/MAX)| IX_T1 |  9999 |   126K| 
> |          |
> 
> -----------------------------------------------------------------------------------------
> 
>  the plan still does two index range scans for every row in T1 ,
> 
>  but avoids sorting the union of T1 and T2.
> 
> 
> 
> 
> 
>  
> 
>

An interesting approach, i think, this can be rewritten (to avoid mentioned min/max index range scans) into

with u as (select 'A' event,timestamp from tablea

union all
select 'B' event,timestamp from tableb), u1 as (select u.*,

last_value(decode(event,'B',timestamp) ignore nulls) over(order by timestamp desc) b_timestamp
from u)
select *
from u1, u1 u2
where u1.timestamp < u2.b_timestamp
and u2.timestamp <= nvl(u1.timestamp,(select max(timestamp) from tablea)) and u1.event='A' and u2.event='A'

which in my tests on relatively small datasets yields better plan (and runtime). However, it seems (again on small datasets), the earlier suggested query with connect_by_root still performs better...

Best regards

Maxim Received on Sun Apr 08 2007 - 17:25:07 CDT