Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.server -> Re: SQL: Working with huge tables of chronological data
On Apr 5, 7:44 pm, Maxim Demenko <mdeme..._at_gmail.com> wrote:
> Charles Hooper schrieb:
> > On Apr 5, 3:50 pm, "John" <acide.ascorbi..._at_gmail.com> wrote:
> >>> Is this the real problem or a simplification?
> >> It's a simplification but not that much. The real problem involves
> >> user_ids but this part can be skipped here.
>
> >>> Is this something that will be run once or repeatedly?
> >> Only once.
>
> >>> Is it possible for the same time to be in A and B?
> >> No, A and B are completely different data.
>
> >>> Is it possible to have a B before an A beginning the sequence?
> >>> Is it possible for there to be multiple Bs between As?
> >> Yes everything is possible, A events and B events happen
> >> independently.
>
> >> Thanks for being interested in my problem!
>
> >> John
>
> > I would be inclined to not handle this by not just using SQL. You
> > could potentially have a nearly full Cartesian join on the first table
> > to itself, for example:
>
> > Charles Hooper
> > PC Support Specialist
> > K&M Machine-Fabricating, Inc.
>
> I suggest, pure SQL solution could be like this:
(SNIP)
> WITH u as ( SELECT * FROM a UNION ALL SELECT * FROM b ORDER BY 2)
> ,
> t as
> (SELECT u.*,
> row_number() over(order by edate) id, -- to enumerate
> all events
>
> decode(lag(event) over (order by edate), event, 0, 1)
> start_of_group
> FROM u
> ORDER BY edate
> )
> ,
> t1 as ( SELECT id,event,edate,sum(start_of_group) over(order by
> edate) group_id FROM t)
> ,
> t2 as
> (SELECT id,
> event,
> edate,
> group_id,
> count(*) over(partition by group_id) cnt_in_group,
> row_number() over(partition by group_id order by edate)
> row_num
> FROM t1
> )
> ,
> t3 as
> (SELECT connect_by_root(id) first_event,
> connect_by_root(edate) first_date,
> id last_event,
> edate last_date
> FROM t2
> WHERE cnt_in_group >1
> AND event ='A' connect by prior row_num=row_num-1
> AND prior group_id=group_id
> )
> SELECT *
> FROM t3
> WHERE first_event != last_event
> ORDER BY first_event,
> last_event;
>
> However, considering the data volumes ( 1e8 for A Events and 3e7 for B
> Events), this could result in worst case ( all B events are
> chronologically after or before A events) in 1e8!/2!(1e8-2)!
> permutations, what nearly equals 1e16 rows, in best case, if all B
> events are evenly distributed among A events, this will result in
> approximately 1e7 rows, assuming the distribution is somewhere in the
> middle - still would expect a lot of data in result set - i would second
> the Charles suggestion to process programmatically, possibly dividing
> the source data in chunks.
>
> Best regards
>
> Maxim
Nice. I stopped short of providing a solution in my previous post. Here is an example of an all SQL solution, continuing from my previous post:
Let's try an experiment to see if we can generate a sequential counter that skips when either TIME_SLOT or NEXT_TIME_SLOT is not A: SELECT
TIME_DATE, NEXT_TIME_DATE, (ROW_NUMBER() OVER (ORDER BY TIME_DATE))*DECODE(TIME_SLOT,'A',1,NULL)*DECODE(NEXT_TIME_SLOT,'A',1,NULL) TT, TIME_SLOT,
==================== ==================== =========== ========= ============== ===== ========== 05-APR-2007 00:02:00 05-APR-2007 00:05:00 1 A A 00:02 1 05-APR-2007 00:05:00 05-APR-2007 00:07:00 2 A A 00:05 2 05-APR-2007 00:07:00 05-APR-2007 00:09:00 A B 00:07 3 05-APR-2007 00:09:00 05-APR-2007 00:10:00 B A 05-APR-2007 00:10:00 05-APR-2007 00:12:00 5 A A 00:10 4 05-APR-2007 00:12:00 05-APR-2007 00:15:00 6 A A 00:12 5 05-APR-2007 00:15:00 05-APR-2007 00:17:00 7 A A 00:15 6 05-APR-2007 00:17:00 05-APR-2007 00:18:00 A B 00:17 7 05-APR-2007 00:18:00 05-APR-2007 00:20:00 B A 05-APR-2007 00:20:00 05-APR-2007 00:22:00 10 A A 00:20 8 05-APR-2007 00:22:00 05-APR-2007 00:25:00 11 A A 00:22 9 05-APR-2007 00:25:00 05-APR-2007 00:27:00 12 A A 00:25 10 05-APR-2007 00:27:00 05-APR-2007 00:28:00 A B 00:27 11 05-APR-2007 00:28:00 05-APR-2007 00:30:00 B A 05-APR-2007 00:30:00 05-APR-2007 00:32:00 15 A A 00:30 12 05-APR-2007 00:32:00 05-APR-2007 00:35:00 16 A A 00:32 13 05-APR-2007 00:35:00 05-APR-2007 00:37:00 A AB 00:35 14 05-APR-2007 00:37:00 05-APR-2007 00:40:00 AB A 00:37 15 05-APR-2007 00:40:00 05-APR-2007 00:42:00 19 A A 00:40 16 05-APR-2007 00:42:00 05-APR-2007 00:45:00 20 A A 00:42 17 05-APR-2007 00:45:00 05-APR-2007 00:47:00 A AB 00:45 18 05-APR-2007 00:47:00 05-APR-2007 00:50:00 AB A 00:47 19 05-APR-2007 00:50:00 05-APR-2007 00:52:00 23 A A 00:50 20 05-APR-2007 00:52:00 05-APR-2007 00:55:00 24 A A 00:52 21 05-APR-2007 00:55:00 05-APR-2007 00:56:00 A B 00:55 22 05-APR-2007 00:56:00 05-APR-2007 00:57:00 B A 05-APR-2007 00:57:00 05-APR-2007 01:00:00 27 A A 00:57 23 05-APR-2007 01:00:00 05-APR-2007 01:02:00 28 A A 01:00 24 05-APR-2007 01:02:00 05-APR-2007 01:05:00 A AB 01:02 25 05-APR-2007 01:05:00 05-APR-2007 01:07:00 AB A 01:05 26 05-APR-2007 01:07:00 05-APR-2007 01:10:00 31 A A 01:07 27...
Now, let's use SYS_CONNECT_BY_PATH to connect the start and end times
together:
SELECT
TIME_DATE,
NEXT_TIME_DATE,
TO_CHAR(TIME_DATE,'HH24:MI') ||'-' ||
SUBSTR(SUBSTR(SYS_CONNECT_BY_PATH(TO_CHAR(NEXT_TIME_DATE,'HH24:MI'),','),
2,50)||',', 1,INSTR(SUBSTR(SYS_CONNECT_BY_PATH(TO_CHAR(NEXT_TIME_DATE,'HH24:MI'),','), 2,50)||',',',')-1) TIME_RANGE
TIME_DATE, NEXT_TIME_DATE, (ROW_NUMBER() OVER (ORDER BY TIME_DATE))*DECODE(TIME_SLOT,'A',1,NULL)*DECODE(NEXT_TIME_SLOT,'A',1,NULL) TS, TIME_SLOT,
TIME_DATE NEXT_TIME_DATE TIME_RANGE
==================== ==================== ========================================================= 05-APR-2007 00:02:00 05-APR-2007 00:05:00 00:02-00:05 05-APR-2007 00:05:00 05-APR-2007 00:07:00 00:05-00:07 05-APR-2007 00:02:00 05-APR-2007 00:05:00 00:02-00:07 05-APR-2007 00:10:00 05-APR-2007 00:12:00 00:10-00:12 05-APR-2007 00:12:00 05-APR-2007 00:15:00 00:12-00:15 05-APR-2007 00:10:00 05-APR-2007 00:12:00 00:10-00:15 05-APR-2007 00:15:00 05-APR-2007 00:17:00 00:15-00:17 05-APR-2007 00:12:00 05-APR-2007 00:15:00 00:12-00:17 05-APR-2007 00:10:00 05-APR-2007 00:12:00 00:10-00:17 05-APR-2007 00:20:00 05-APR-2007 00:22:00 00:20-00:22 05-APR-2007 00:22:00 05-APR-2007 00:25:00 00:22-00:25 05-APR-2007 00:20:00 05-APR-2007 00:22:00 00:20-00:25 05-APR-2007 00:25:00 05-APR-2007 00:27:00 00:25-00:27 05-APR-2007 00:22:00 05-APR-2007 00:25:00 00:22-00:27 05-APR-2007 00:20:00 05-APR-2007 00:22:00 00:20-00:27 05-APR-2007 00:30:00 05-APR-2007 00:32:00 00:30-00:32 05-APR-2007 00:32:00 05-APR-2007 00:35:00 00:32-00:35 05-APR-2007 00:30:00 05-APR-2007 00:32:00 00:30-00:35 05-APR-2007 00:40:00 05-APR-2007 00:42:00 00:40-00:42 05-APR-2007 00:42:00 05-APR-2007 00:45:00 00:42-00:45 05-APR-2007 00:40:00 05-APR-2007 00:42:00 00:40-00:45 05-APR-2007 00:50:00 05-APR-2007 00:52:00 00:50-00:52 05-APR-2007 00:52:00 05-APR-2007 00:55:00 00:52-00:55 05-APR-2007 00:50:00 05-APR-2007 00:52:00 00:50-00:55 05-APR-2007 00:57:00 05-APR-2007 01:00:00 00:57-01:00 05-APR-2007 01:00:00 05-APR-2007 01:02:00 01:00-01:02 05-APR-2007 00:57:00 05-APR-2007 01:00:00 00:57-01:02 05-APR-2007 01:07:00 05-APR-2007 01:10:00 01:07-01:10 05-APR-2007 01:10:00 05-APR-2007 01:12:00 01:10-01:12 05-APR-2007 01:07:00 05-APR-2007 01:10:00 01:07-01:12 05-APR-2007 01:17:00 05-APR-2007 01:20:00 01:17-01:20 05-APR-2007 01:20:00 05-APR-2007 01:22:00 01:20-01:22 05-APR-2007 01:17:00 05-APR-2007 01:20:00 01:17-01:22 05-APR-2007 01:25:00 05-APR-2007 01:27:00 01:25-01:2705-APR-2007 01:27:00 05-APR-2007 01:30:00 01:27-01:30 05-APR-2007 01:25:00 05-APR-2007 01:27:00 01:25-01:30 ...
(Rows 1652) CONNECT BY WITHOUT FILTERING (cr=23 pr=0 pw=0 time=31559 us)
(Rows 1194) VIEW (cr=23 pr=0 pw=0 time=29665 us) (Rows 1194) WINDOW NOSORT (cr=23 pr=0 pw=0 time=24882 us) (Rows 1194) VIEW (cr=23 pr=0 pw=0 time=20154 us) (Rows 1194) WINDOW SORT (cr=23 pr=0 pw=0 time=14131 us) (Rows 1194) VIEW (cr=23 pr=0 pw=0 time=15595 us) (Rows 1194) SORT ORDER BY (cr=23 pr=0 pw=0 time=12007 us) (Rows 1194) VIEW (cr=23 pr=0 pw=0 time=28726 us) (Rows 1194) UNION-ALL (cr=23 pr=0 pw=0 time=23946 us) (Rows 1000) HASH JOIN RIGHT OUTER (cr=14 pr=0 pw=0 time=11251 us) (Rows 300) TABLE ACCESS FULL T2 (cr=7 pr=0 pw=0 time=662 us) (Rows 1000) TABLE ACCESS FULL T1 (cr=7 pr=0 pw=0 time=2026 us) (Rows 194) MERGE JOIN ANTI (cr=9 pr=0 pw=0 time=4647 us) (Rows 300) TABLE ACCESS BY INDEX ROWID T2 (cr=2 pr=0 pw=0 time=2128 us) (Rows 300) INDEX FULL SCAN T2_IND1 (cr=1 pr=0 pw=0 time=918 us) (Rows 106) SORT UNIQUE (cr=7 pr=0 pw=0 time=2528 us) (Rows 1000) INDEX FAST FULL SCAN T1_IND1 (cr=7 pr=0pw=0 time=2026 us)
Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc.
Received on Thu Apr 05 2007 - 21:51:28 CDT