Re: SQL: Working with huge tables of chronological data

From: Maxim Demenko <mdemenko_at_gmail.com>
Date: Fri, 06 Apr 2007 01:44:50 +0200
Message-ID: <461589F2.8010206@gmail.com>

Charles Hooper schrieb:
> On Apr 5, 3:50 pm, "John" <acide.ascorbi..._at_gmail.com> wrote:

>>> Is this the real problem or a simplification?
>> It's a simplification but not that much. The real problem involves
>> user_ids but this part can be skipped here.
>>
>>> Is this something that will be run once or repeatedly?
>> Only once.
>>
>>> Is it possible for the same time to be in A and B?
>> No, A and B are completely different data.
>>
>>> Is it possible to have a B before an A beginning the sequence?
>>> Is it possible for there to be multiple Bs between As?
>> Yes everything is possible, A events and B events happen
>> independently.
>>
>> Thanks for being interested in my problem!
>>
>> John

>
> I would be inclined to not handle this by not just using SQL. You
> could potentially have a nearly full Cartesian join on the first table
> to itself, for example:
>
> Charles Hooper
> PC Support Specialist
> K&M Machine-Fabricating, Inc.
>

I suggest, pure SQL solution could be like this:

create table a as
with t as (
select 'A' event,dbms_random.value(1,31) + trunc(sysdate,'MM') edate from dual
connect by level <= 60 )
select * from t;

create table b as
with t as (select 'B' event,dbms_random.value(1,31) + trunc(sysdate,'MM') edate
from dual
connect by level <= 15)
select * from t;

WITH u as ( SELECT * FROM a UNION ALL SELECT * FROM b ORDER BY 2)

         ,
         t as
         (SELECT u.*,
                 row_number() over(order by edate) id, -- to enumerate

all events

                 decode(lag(event) over (order by edate), event, 0, 1) 
start_of_group
         FROM    u
         ORDER BY edate
         )
         ,
         t1 as ( SELECT id,event,edate,sum(start_of_group) over(order by 
edate) group_id FROM t)
         ,
         t2 as
         (SELECT id,
                 event,
                 edate,
                 group_id,
                 count(*) over(partition by group_id) cnt_in_group,
                 row_number() over(partition by group_id order by edate) 
row_num
         FROM    t1
         )
         ,
         t3 as
         (SELECT connect_by_root(id) first_event,
                 connect_by_root(edate) first_date,
                 id last_event,
                 edate last_date
         FROM    t2
         WHERE   cnt_in_group  >1
             AND event         ='A' connect by prior row_num=row_num-1
             AND prior group_id=group_id
         )

SELECT *
FROM t3
WHERE first_event != last_event
ORDER BY first_event,

last_event;

However, considering the data volumes ( 1e8 for A Events and 3e7 for B Events), this could result in worst case ( all B events are chronologically after or before A events) in 1e8!/2!(1e8-2)! permutations, what nearly equals 1e16 rows, in best case, if all B events are evenly distributed among A events, this will result in approximately 1e7 rows, assuming the distribution is somewhere in the middle - still would expect a lot of data in result set - i would second the Charles suggestion to process programmatically, possibly dividing the source data in chunks.

Best regards

Maxim Received on Thu Apr 05 2007 - 18:44:50 CDT