charely schrieb:
> "Charles Hooper" <hooperc2000_at_yahoo.com> schreef in bericht
> news:1176044631.189804.86380_at_y66g2000hsf.googlegroups.com...
>> On Apr 8, 4:08 am, "charely" <nos..._at_skynet.be> wrote:
>>> "John" <acide.ascorbi..._at_gmail.com> schreef in
>>> berichtnews:1175788930.059725.19220_at_e65g2000hsc.googlegroups.com...
>>>> Here are the technical details and the query I've been using so far.
>>>> TableA is ~100 millions row and contains (timestamp, evtA)
>>>> TableB is ~30 millions row and contains (timestamp, evtB)
>>>> The following query took ~60h (on a private but quite slow server) to
>>>> compute. ~1h is what I'm aiming to.
>>>> select TA1_B.evtA, TA2.evtA
>>>> from
>>>> (
>>>> select TA1.evtA, TA1.timestamp timeA1, TB.evtB, min(TB.timestamp)
>>>> min_timeB
>>>> from tableA TA1 left outer join tableB TB on (TA1.timestamp <
>>>> TB.timestamp)
>>>> group by TA1.evtA, TA1.timestamp, TB.evtB
>>>> ) TA1_B,
>>>> tableA TA2
>>>> where
>>>> TA1_B.timeA1 < TA2.timestamp
>>>> and (TA2.timestamp < TA1_B.min_timeB or TA1_B.min_timeB is null)
>>>> and TA1_B.evtA <> TA2.evtA;
>>>> Thanks!
>>>> John
>>> What about
>>>
>>> select ta1.timestamp , ta2.timestamp from tablea ta1 , tablea ta2
>>> where ta2.timestamp > ta1.timestamp and ta2.timestamp <=
>>> nvl((select min(timestamp) from tableb b where b.timestamp >
>>> ta1.timestamp) ,
>>> (select max(timestamp) from tablea))
>>>
>>> The nvl function is only needed to also catch events where no later
>>> event
>>> in b exist.
>>>
>>> I have not tested this for performance , but assuming you have indexes
>>> on
>>> the timestamp
>>> columns ( or using IOTs for the tables) , the optimizer will probably
>>> use
>>> range scans on
>>> those indexes.
>> I don't that that the SQL statement you provided will offer the data
>> that the OP was wanting. Assume the following:
>> TABLEA (T1)
>> 05-APR-2007 00:07:00
>> 05-APR-2007 00:10:00
>> 05-APR-2007 00:12:00
>> 05-APR-2007 00:15:00
>> 05-APR-2007 00:17:00
>> 05-APR-2007 00:20:00
>>
>> TABLEB (T2)
>> 05-APR-2007 00:09:00
>> 05-APR-2007 00:18:00
>>
>> The OP wanted to retrieve pairs of data from two rows of TABLEA (T1)
>> where a value from TABLEB (T2) does not fall between the pairs from
>> TABLEA (T1). In this case, we need to report all pairs of values
>> between the time values 05-APR-2007 00:10:00, 05-APR-2007 00:12:00, 05-
>> APR-2007 00:15:00, and 05-APR-2007 00:17:00 since those fall between
>> the two values from TABLEB (T2). That should yield the following
>> list:
>> 00:10-00:12
>> 00:12-00:15
>> 00:10-00:15
>> 00:15-00:17
>> 00:12-00:17
>> 00:10-00:17
>>
>> Let's reformat the query that you provided so that it can use the
>> sample tables and indexes that I provided previously, in order to see
>> if it is an efficient starting point for the OP:
>> SELECT /*+ GATHER_PLAN_STATISTICS */
>> TA1.V1,
>> TA2.V1
>> FROM
>> T1 TA1,
>> T2 TA2
>> WHERE
>> TA2.V1 > TA1.V1
>> AND TA2.V1 <= NVL(
>> (SELECT
>> MIN(V1)
>> FROM
>> T2 B
>> WHERE
>> B.V1 > TA1.V1),
>> (SELECT
>> MAX(V1)
>> FROM
>> T1));
>>
>> SELECT
>> *
>> FROM
>> TABLE (DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
>>
>> | Id | Operation | Name | Starts | E-Rows | A-
>> Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
>> -------------------------------------------------------------------------------------------------------------------------------
>> |* 1 | FILTER | | 1 | |
>> 1000 |00:00:00.47 | 79 | | | |
>> | 2 | MERGE JOIN | | 1 | 15000
>> | 167K|00:00:01.17 | 8 | | | |
>> | 3 | SORT JOIN | | 1 | 300
>> | 300 |00:00:00.01 | 1 | 11264 | 11264 |10240 (0)|
>> | 4 | INDEX FULL SCAN | T2_IND1 | 1 | 300
>> | 300 |00:00:00.01 | 1 | | | |
>> |* 5 | SORT JOIN | | 300 | 1000
>> | 167K|00:00:00.50 | 7 | 36864 | 36864 |32768 (0)|
>> | 6 | INDEX FAST FULL SCAN | T1_IND1 | 1 | 1000 |
>> 1000 |00:00:00.01 | 7 | | | |
>> | 7 | SORT AGGREGATE | | 71341 | 1 |
>> 71341 |00:00:02.09 | 69 | | | |
>> | 8 | FIRST ROW | | 71341 | 15 |
>> 71341 |00:00:01.21 | 69 | | | |
>> |* 9 | INDEX RANGE SCAN (MIN/MAX) | T2_IND1 | 71341 | 15 |
>> 71341 |00:00:00.47 | 69 | | | |
>> | 10 | SORT AGGREGATE | | 1 | 1
>> | 1 |00:00:00.01 | 2 | | | |
>> | 11 | INDEX FULL SCAN (MIN/MAX)| T1_IND1 | 1 | 1000
>> | 1 |00:00:00.01 | 2 | | | |
>> -------------------------------------------------------------------------------------------------------------------------------
>>
>> Predicate Information (identified by operation id):
>> ---------------------------------------------------
>> 1 - filter("TA2"."V1"<=NVL(,))
>> 5 -
>> access(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
>>
>> filter(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
>> 9 - access("B"."V1">:B1)
>>
>> Notice the Starts column in the above - that is the number of times
>> that portion of the plan was executed. Also note the significant
>> difference between estimated and actual rows, as well as what is
>> reported in the actual time column. Where did that bind variable in
>> the predicate information for ID 9 come from (Oracle optimization)?
>>
>> Here is the plan for the final SQL statement that I provided, just for
>> comparison:
>> | Id | Operation | Name | Starts | E-
>> Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-
>> Mem |
>> -----------------------------------------------------------------------------------------------------------------------------------------------
>> |* 1 | CONNECT BY WITHOUT FILTERING | | 1
>> | | 1652 |00:00:00.06 | 23 | 12 | |
>> | |
>> | 2 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.07 | 23 | 12 | |
>> | |
>> | 3 | WINDOW NOSORT | | 1 |
>> 1035 | 1194 |00:00:00.06 | 23 | 12 | 73728 | 73728
>> | |
>> | 4 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.06 | 23 | 12 | |
>> | |
>> | 5 | WINDOW SORT | | 1 |
>> 1035 | 1194 |00:00:00.05 | 23 | 12 | 43008 | 43008 |38912
>> (0)|
>> | 6 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.05 | 23 | 12 | |
>> | |
>> | 7 | SORT ORDER BY | | 1 |
>> 1035 | 1194 |00:00:00.05 | 23 | 12 | 57344 | 57344 |51200
>> (0)|
>> | 8 | VIEW | | 1 |
>> 1035 | 1194 |00:00:00.07 | 23 | 12 | |
>> | |
>> | 9 | UNION-ALL | | 1
>> | | 1194 |00:00:00.06 | 23 | 12 | |
>> | |
>> |* 10 | HASH JOIN RIGHT OUTER | | 1 |
>> 1000 | 1000 |00:00:00.05 | 14 | 12 | 898K| 898K| 1132K
>> (0)|
>> | 11 | TABLE ACCESS FULL | T2 | 1 |
>> 300 | 300 |00:00:00.04 | 7 | 6 | |
>> | |
>> | 12 | TABLE ACCESS FULL | T1 | 1 |
>> 1000 | 1000 |00:00:00.01 | 7 | 6 | |
>> | |
>> | 13 | MERGE JOIN ANTI | | 1 |
>> 35 | 194 |00:00:00.01 | 9 | 0 | | |
>> |
>> | 14 | TABLE ACCESS BY INDEX ROWID| T2 | 1 |
>> 300 | 300 |00:00:00.01 | 2 | 0 | |
>> | |
>> | 15 | INDEX FULL SCAN | T2_IND1 | 1 |
>> 300 | 300 |00:00:00.01 | 1 | 0 | |
>> | |
>> |* 16 | SORT UNIQUE | | 300 |
>> 1000 | 106 |00:00:00.01 | 7 | 0 | 36864 | 36864 |32768
>> (0)|
>> | 17 | INDEX FAST FULL SCAN | T1_IND1 | 1 |
>> 1000 | 1000 |00:00:00.01 | 7 | 0 | |
>> | |
>>
>> Predicate Information (identified by operation id):
>> ---------------------------------------------------
>> 1 - access("TS"+1=PRIOR NULL)
>> 10 - access("T1"."V1"="T2"."V1")
>> 16 - access("T1"."V1"="T2"."V1")
>> filter("T1"."V1"="T2"."V1")
>>
>> If nothing else, this shows how efficient analytic functions can be
>> when compared to other methods.
>>
>> Regarding Mladen Gogala suggestion to use LAG(V1,1), LAG(V1,2),
>> LAG(V1,3), ... LAG(V1, n) - I looked at using that method initially.
>> The problems that I encountered: #1 what should the value of n be so
>> that I do not miss any potential matches, #2 how do I create a new
>> result row for each of the matches, since all of the matches will be
>> returned on the same result row. In other words, you will only be
>> able to return one set of matches per result row when using LAG,
>> unless you can find some way to uncoil all of the resulting matches
>> into new result rows. That may be the reason that Jonathan Lewis
>> stated "the option using the analytic lag() would only give you the
>> adjacent pairs."
>>
>> I would also suggest not retrieving all rows from the two tables
>> (TableA is ~100 millions rows, TableB is ~30 millions rows) and
>> processing the data client side, since you would then also need to
>> consider latency caused by network traffic when the 130 million rows
>> are retrieved from the database and sent to the client for processing.
>>
>> Charles Hooper
>> PC Support Specialist
>> K&M Machine-Fabricating, Inc.
>>
>
> As far as I can tell, the provided query does match the OP's request,
> but I do agree that the CBO has generated a poor execution plan.
> I got a better plan with
>
> select /*+USE_NL(ta2 ta1)*/ ta1.v1 , ta2.v1 from t1 ta1 , t1 ta2
> where ta2.v1 > ta1.v1 and ta2.v1 <=
> nvl((select min(v1) from t2 b where b.v1 > ta1.v1) ,
> (select max(v1) from t1))
>
>
> -----------------------------------------------------------------------------------------
>
> | Id | Operation | Name | Rows | Bytes | Cost
> (%CPU)| Time |
>
> -----------------------------------------------------------------------------------------
>
> | 0 | SELECT STATEMENT | | 25 | 650 | 7514K
> (1)| 25:02:57 |
>
> | 1 | NESTED LOOPS | | 249K| 6346K| 20258
> (2)| 00:04:04 |
>
> | 2 | TABLE ACCESS FULL | T1 | 9999 | 126K| 9
> (0)| 00:00:01 |
>
> |* 3 | INDEX RANGE SCAN | IX_T1 | 25 | 325 | 2
> (0)| 00:00:01 |
>
> | 4 | SORT AGGREGATE | | 1 | 13 |
> | |
>
> | 5 | FIRST ROW | | 500 | 6500 | 2
> (0)| 00:00:01 |
>
> |* 6 | INDEX RANGE SCAN (MIN/MAX) | IX_T2 | 500 | 6500 | 2
> (0)| 00:00:01 |
>
> | 7 | SORT AGGREGATE | | 1 | 13 |
> | |
>
> | 8 | INDEX FULL SCAN (MIN/MAX)| IX_T1 | 9999 | 126K|
> | |
>
> -----------------------------------------------------------------------------------------
>
> the plan still does two index range scans for every row in T1 ,
>
> but avoids sorting the union of T1 and T2.
>
>
>
>
>
>
>
>
An interesting approach, i think, this can be rewritten (to avoid
mentioned min/max index range scans) into
with u as (select 'A' event,timestamp from tablea
union all
select 'B' event,timestamp from tableb),
u1 as (select u.*,
last_value(decode(event,'B',timestamp) ignore nulls) over(order by
timestamp desc) b_timestamp
from u)
select *
from u1, u1 u2
where u1.timestamp < u2.b_timestamp
and u2.timestamp <= nvl(u1.timestamp,(select max(timestamp) from tablea))
and u1.event='A' and u2.event='A'
which in my tests on relatively small datasets yields better plan (and
runtime). However, it seems (again on small datasets), the earlier
suggested query with connect_by_root still performs better...
Best regards
Maxim
Received on Sun Apr 08 2007 - 17:25:07 CDT