Re: Summing hierarchical data SQL

From: vc <boston103_at_hotmail.com>
Date: 5 Oct 2006 11:40:54 -0700
Message-ID: <1160073653.968940.210260@h48g2000cwc.googlegroups.com>

Michel Cadot wrote:
[...]

> -----------------------------------------
>
> When I said I don't think I meant I don't think you can build
> an *efficient* query without a correlated subquery.

Nice try. What you actually said was "I don't think so" without qulifying your message with any performance concerns (as I did in my original message):

> Yes, you can write many awful queries like yours building
> twice the hierarchy with nested loops between the two.

How about some statistical evidence to backup your "awful" adjective ?

>
> A little less worse version is:
>
> with
> data as (
> select /*+ materialized */
> id, name, parent, frequency,
> sys_connect_by_path (id, '/') id_path
> from t
> connect by prior id = parent
> start with parent is null
> )
> select a.id, a.name, a.parent, sum(b.frequency) frequency
> from data a, data b
> where instr(b.id_path,a.id_path) > 0
> group by a.id, a.name, a.parent
> order by a.id
> /
>
> which build a temporary table with the result of the connect by
> but still leads to nested loops.

Sure one can optimize the original query by factoring out the subquery at the expense of materiaizing the latter.

>
> My first query is built around an hash join between the original
> table and the full hierarchy generated by the connect by.

You are confused, amigo. The hash join you see in the execution plan is used to create the hierarchy even for a simple query like:

select id, frequency

             from t1
             connect by prior id = parent
             start with parent is null

The hash join has got zip to do with the correlated subquery.

What actually happens with the subquery is that many smaller hierarchies are created on fly, one per each row of the original table (that why the query is called 'correlated'), not the "full hierarchy" as you've claimed..

As for some stat. evidence:

sys_connect

SQL> select
  2 a.id,
  3 a.name,
  4 sum(b.frequency)
  5 from
  6 (select id, name, sys_connect_by_path(id, '/') path   7 from t1 connect by prior id = parent start with parent is null   8 ) a,
  9 (select frequency, sys_connect_by_path(id, '/') path 10 from t1 connect by prior id = parent start with parent is null 11 ) b
12 where instr(b.path, a.path) = 1
13 group by a.id, a.name
14 /

Statistics

          0  recursive calls
          0  db block gets
        196  consistent gets
          0  physical reads
          0  redo size
        595  bytes sent via SQL*Net to client
        500  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         29  sorts (memory)
          0  sorts (disk)
          6  rows processed

SQL>

--
B. correlated

SQL> select id, name, parent,
  2             ( select sum(frequency)
  3               from t1 t2
  4               connect by prior id = parent
  5               start with t2.id = t1.id ) frequency
  6  from t1 t1
  7  /


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        120  consistent gets
          0  physical reads
          0  redo size
        655  bytes sent via SQL*Net to client
        500  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         16  sorts (memory)
          0  sorts (disk)
          6  rows processed

SQL>

----

C.  "with"

SQL> with
  2    data as (
  3      select /*+ materialized */
  4             id, name, parent, frequency,
  5             sys_connect_by_path (id, '/') id_path
  6      from t1
  7      connect by prior id = parent
  8      start with parent is null
  9    )
 10  select a.id, a.name, a.parent, sum(b.frequency) frequency
 11  from data a, data b
 12  where instr(b.id_path,a.id_path) =1
 13  group by a.id, a.name, a.parent
 14  /

6 rows selected.



Statistics
----------------------------------------------------------
          2  recursive calls
          8  db block gets
         51  consistent gets
          1  physical reads
        604  redo size
        655  bytes sent via SQL*Net to client
        500  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          5  sorts (memory)
          0  sorts (disk)
          6  rows processed



As can be seen,  with respect to consistent gets A (sys_connect)  is
worse than B  (correlated) by 50%, and C( sys_connect optimized with
"with") is actually two times better than B at the expense of course of
having to create the materialized  hierarchy.    One may actually want
to try different approaches before finally deciding what is "awful" and
what is not.



> 

> Regards

> Michel Cadot

Received on Thu Oct 05 2006 - 13:40:54 CDT