Re: A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation
Date: 3 May 2007 04:41:25 -0700
Message-ID: <1178192485.028723.208580_at_l77g2000hsb.googlegroups.com>
On May 3, 4:33 am, Cimode <cim..._at_hotmail.com> wrote:
> On May 3, 10:12 am, "Brian Selzer" <b..._at_selzer-software.com> wrote:> "Cimode" <cim...@hotmail.com> wrote in message
>
> >news:1178173985.960152.227740_at_l77g2000hsb.googlegroups.com...
> > [snip]
>
> > >> Eliminating self-joins is beneficial regardless of the implementation.
> > > Beneficial to what ? Performance ? What performance ? Response
> > > time ? concurrency ? cost of administration ? (Please answer these
> > > precise questions). Do you realize that despite some lengthy (maybe
> > > worthy) attempts at clarifying your point, some people here have no
> > > clue what you are talking about.
>
> > I'll try to be as concise as possible. Eliminating multiple self-joins
> > reduces the number of passes through the data required to answer a query.
>
> So you are taking about physical IO's. Right? Can you explain to me
> how you reduce physical IO's when itterating on the basis of number of
> lines multiplied by the number of operations required. Can you
> explain to me how
>
> > This reduces (1) execution time,
>
> How ? Could you establish that execution time is reduced in any
> concurrent environment.
>
> (2) the duration that locks are held,
> How ? What kind of lock ? What isolation level are you refering to ?
>
> > (3) the amount of I/O required,
>
> What IO's ? physical ? logical IO?
>
> > (4) CPU utilization, and
>
> Do you mean percentage of usage ?
>
> > (5) the amount of
> > memory required. The reduction in execution time improves response time.
>
> I am not sure I understand I understand this last sentence. Are you
> aware that most of the time used in a process is mostly (compile +
> binding). Execution is a minor part of the resources used.
>
> > The reduction in the duration that locks are held improves concurrency.
>
> What locks ? How does the duration of a lock improve concurrency in
> most modern systems. Are your familiar with lock sharing ? Do you
> realize that most SQL DBMS do mutualize object physical accesses in a
> multi transactional context?
>
> > The
> > reduction in I/O required, CPU utilization, and memory required improves
> > overall throughput. I would venture a guess that the cost of administration
> > or at least of maintenance would be reduced as well: I personally think it's
> > a lot easier to follow a simple fetch loop followed by a simple UPDATE FROM
> > than a complicated set-based solution with four outer theta-joins, an outer
> > equijoin, an inner equijoin, an aggregation and a union.
>
> I still do not understand how having one view would be more cumbersome
> to maintain then several operations (update, creation of additional
> objects)...
>
> Cursors are the cause of most timeouts I face daily. (whether due to
> dealock, excessive HD/RAM swapping or exlusive object locking).
>
> Regards...
>
>
>
> > [snip]- Hide quoted text -
>
> - Show quoted text -
Not sure if it advances the conversation, but I thought I'd look into
this a bit further.
I decided to look into this a bit further. Obviously, no by-hand
iteration-based code is going to beat a declarative solution executed
by a "sufficiently intelligent" optimizer. Yet, I think the query can
be expressed much more concisely (allowing the optimizer a much better
chance at an efficient execution) than what Cimode posted.
To test this theory, I created an N-row table in Oracle called K1:
K1 (n int, d float, t time) -- "Station <n> is <d> miles from previous station and recorded the train at time <t>"
The table was populated as follows (given some number N):
<n = sequence (2..N-1),
d = random float between 0 and 1,
t = "coin-toss", if "heads" then current_time + <n>minutes +
random(0,1) minutes, else NULL>
... plus 2 boundary tuples:
<1,0,current_time>
<N, 1, current_time+N minutes>
A sample N=10 table:
1 0 5/2/2007 12:50:13 PM 2 0.47 3 0.94 5/2/2007 12:53:38 PM 4 0.65 5/2/2007 12:54:51 PM 5 0.64 6 0.98 7 0.6 8 0.65 9 0.12 10 1 5/2/2007 1:00:13 PM
I then wrote a query which produced this output for the above K1 (which can be checked for accuracy):
1 0 5/2/2007 12:50:13 PM 2 0.47 5/2/2007 12:51:21 PM 3 0.94 5/2/2007 12:53:38 PM 4 0.65 5/2/2007 12:54:51 PM 5 0.64 5/2/2007 12:55:43 PM 6 0.98 5/2/2007 12:57:02 PM 7 0.6 5/2/2007 12:57:50 PM 8 0.65 5/2/2007 12:58:43 PM 9 0.12 5/2/2007 12:58:52 PM 10 1 5/2/2007 1:00:13 PM
I then loaded K1 with N=1000000 (1 million), and submitted the query, storing all results into a new table, K2. The index was built on K1(n) in 5 seconds, and the 1-million-row query ran in 1 minute 42 seconds, giving a total runtime of 1 minute 47 seconds. Based on 50k, 100k, and 500k tests, the performance is O(N log N), so you can project this to any K1 tablesize you'd like.
I'll forgo a complete proof of correctness, and will simply note that Card(K2) = 1000000, and that a spotcheck comparison of K1 and the resulting K2 using the 10 ID's between 400000 and 400009 yields:
In K1:
400000 0.22 400001 0.53 2/4/2008 8:10:19 AM 400002 0.17 400003 0.27 2/4/2008 8:11:50 AM 400004 1 2/4/2008 8:12:57 AM 400005 0.63 400006 0.97 400007 0.37 400008 0.56 400009 0.36 2/4/2008 8:18:06 AM and in K2: 400000 0.22 2/4/2008 8:08:43 AM 400001 0.53 2/4/2008 8:10:19 AM 400002 0.17 2/4/2008 8:10:54 AM 400003 0.27 2/4/2008 8:11:50 AM 400004 1 2/4/2008 8:12:57 AM 400005 0.63 2/4/2008 8:14:04 AM 400006 0.97 2/4/2008 8:15:48 AM 400007 0.37 2/4/2008 8:16:28 AM 400008 0.56 2/4/2008 8:17:28 AM 400009 0.36 2/4/2008 8:18:06 AM
2 caveats before posting the query.
Caveat1: Remember my words, "sufficiently intelligent" optimizer? The
Oracle optimizer (aka CBO) is good, but comes up short in approaching
theta-joins. Here, the CBO consistently fails to realize that joining
K1 (n) against some relation R (n_start, n_end, ...) with criteria
"WHERE n BETWEEN n_start AND n_end" is performed in O(r_n log k1_n) by
indexing K1(n) then scanning R and finding matching rows in K1 using
the index.
Lacking this insight, the Oracle CBO consistently chose to join K1
with R using a O(r_n*k_n) approach: match each row in K1 with every
row in R where n >= n_start, then scan the results for all rows where
n <=n_end. Thus, I needed to manually build an index on K1(n), and
force the CBO to use the O(n logn) algorithm via "hints" (embedded
comments that force the CBO's hand), e.g. /*+ ordered use_nl(md)
index(md) */.
Caveat2:
W/ respect to the window-fucntion "RANK() over(ORDER BY n)", I'd be
curious to see a solution that returns the same results as in the "hd"
subquery, yet runs in O(n log n), using only standard relational
operators. I could only come up with the following O(n^2) version:
select k1.n, k1.d, k1.t, count(*) rnk
from k1, k1 t2
where k1.t is not null
and t2.t is not null
and t2.n <= k1.n
group by k1.n, k1.d, k1.t;
That said, here is the query:
with hd as (select n, d, t, rank() over (order by n) rnk
from k1 where t is not null), rng as (select /*+ ordered use_nl(k1) */ s.n ns, e.n ne, s.t ts, e.t te, sum(k1.d) rd from hd s, hd e, k1 where e.rnk = s.rnk+1 and k1.n > s.n and k1.n <= e.n group by s.n , e.n , s.t , e.t ) select /*+ ordered use_nl(md) use_nl(prev) */ md.n, md.d, ts+(sum(prev.d) / rd)*(te-ts) tfrom rng, k1 md, k1 prev
where md.n > rng.ns and md.n < rng.ne
and prev.n > rng.ns and prev.n <= md.n group by md.n, md.d, rd, ts, te
union
select n, d, t from hd;
I should add another bonus of declarative programming - the ease of which I can take advantage of concurrency. The query took 107 seconds. Using a simple parallel hint of degree 8 the query ran in 45 seconds (indexing K1 with parallel degree 8 took about 2 seconds). Degree 8 happens to be the max concurrency of the dev box used for this test; but that's a fraction of what our production machine could do (were I to implement this in our production system with full allocation of all parallel resources, it'd conceivably run in 15 seconds or less).
Brian - does this compare favorably with your procedural approach? Received on Thu May 03 2007 - 13:41:25 CEST