Re: Functions and Relations

From: vc <boston103_at_hotmail.com>
Date: 21 Nov 2006 13:38:09 -0800
Message-ID: <1164145089.368588.316380_at_m73g2000cwd.googlegroups.com>


Aloha Kakuikanu wrote:
> vc wrote:
> > Aloha Kakuikanu wrote:
> > > Hash join is an ugly artifact of physical world where we have
> > > random io inferior to sequential io.
> > I think multiblock vs. single block IO for an index access path for NLs
> > is also an important factor, maybe more important than IO randomness
> > depending on a specific IO/RAID system of course.
>
> By "random IO" i slopilly meant single block IO, and by "sequential IO"
> i meant multiblock IO.
>
> > With solid state disks, reading large portions of memory vs. smaller
> > chunks may still make hash joins more performant(performance complexity
> > O(n) for NL may be higher than O(m+n) for HJ where m and n are
> > relation cardinalities).
>
> I'm not sure I follow your calculation. Anyhow, 5 years ago I had hard
> time finding a performance case where Hash Join was merely 10 times
> faster as indexed nested loop. That was on traditional disc system.
> Now, given no performance penalty for randomly accessed index
> nodes/records, I doubt there is any difference today. With nested loops
> you scan the outer table, navigate about 3-4 levels in the B-tree and
> find the match in the inner table. With hash join you scan the outer
> table, the match in the hash table and retrieve thae matching portion
> from the inner table. That is logically about the same amount of work.

In Oracle 9i:

create table t1/t2 as select * from all_objects; create index t1/t2_idx on t1/t2(object_id);

select/*+ use_nl(t1,t2) */ * from t1, t2 where t1.object_id = t2.object_id
/
Execution Plan


   0 SELECT STATEMENT Optimizer=CHOOSE (Cost=48778 Card=24372 Bytes=4265100)

   1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T2' (Cost=2 Card=1 Bytes=87)

   2    1     NESTED LOOPS (Cost=48778 Card=24372 Bytes=4265100)
   3    2       TABLE ACCESS (FULL) OF 'T1' (Cost=34 Card=24372
Bytes=2144736)
   4    2       INDEX (RANGE SCAN) OF 'T2_IDX' (NON-UNIQUE) (Cost=1
Card=1)

Statistics


      30617 consistent gets

select * from t1, t2 where t1.object_id = t2.object_id /

Execution Plan


   0 SELECT STATEMENT Optimizer=CHOOSE (Cost=142 Card=24372 Bytes=4265100)

   1 0 HASH JOIN (Cost=142 Card=24372 Bytes=4265100)    2 1 TABLE ACCESS (FULL) OF 'T2' (Cost=34 Card=24373 Bytes=2120451)

   3 1 TABLE ACCESS (FULL) OF 'T1' (Cost=34 Card=24372 Bytes=2144736)

Statistics


       2216 consistent gets

The nested loops plan does 15 times more work (30617 c.g. vs. 2216 c.g.) than the hash join. It's hardly "logically about the same amount of work".

> Plus Nested Loops is more elegant from theoretical point of view.
> Remember hash indexes? Nobody uses them today, because B-tree is more
> elegant and more general.
>
> > How do you suggest to materialize 'f(x)' as a relation and why do you
> > think the join will be more performant than just calculating 'f(x)' per
> > each row ?
>
> See the example with finite relation/function `x=1`

That answers "why you think it'll be more performant" but does not answer "How do you suggest to materialize 'f(x)' as a relation ?". So how do you suggest to do the trick ?

>
> > > We can sample a function as a normal relation
> > > and derive statistics.
> >
> > Oracle statistics for example provide food for the cost based optimizer
> > which is mainly interested in IO. Well, versions 9i and above take
> > into account CPU as well, but I believe IO is still the optimizer
> > staple. I am not sure how representing a function as a relation will
> > improve the optimizer bahavior. Could you please explain ?
> >
> > > Even if there is no new access paths, we'll
> > > still benefit from more accurate statistics.
> >
> > How exactly ?
>
> Are you aware that selectivity/cardinality estimation related to
> function invokations in SQL is finicky in today's RDBMS implementations?

I am aware of the cardinality problem with 'select * from t1 where x in (select x from table(function_1(a,b,c))' and its variations, but the problem can be cured with the cardinality hint. It may not be esthetically pleasing, granted, but what is the alternative ? One can use a temp table too which perhaps essentially the same as your suggestion (materializing a function as a temp relation). However, there may be unexceptable overhead when using temp relations, or such temp relation may be infinite and therefore impossible. Received on Tue Nov 21 2006 - 22:38:09 CET

Original text of this message