Re: Join Two Extremely Large Tables

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Mon, 21 Apr 2003 22:52:11 +0100
Message-ID: <b81p43$bcn$1$8302bc10@news.demon.co.uk>

You are quite right about the quantity of logical I/O, and probably physical I/O, but I think you probably missed the start of the thread. A key feature of the processing is the need to collate all the 'B' rows for a given 'A' row. The only way to do this through a hash join is through an explicit 'order by'.

Whilst an unsorted output could be generated quite rapidly by a hash join (given a hash_area_size somewhat larger than the first input) the cost of the subsequent necessary sort could easily outweigh the benefit of the hash.

As the OP also says, the nested loop solution has better scalability (although I would have used the term "predictability"). Hash joins, like sorts, have a 'catastrophe' performance profile - there are two points where a small change in the inputs results in a large change in the response time. Sometimes it is simply better to go for the consistent, slow, solution, rather than live on the edge with a solution that is sometimes faster, and sometimes much slower.

--
Regards

Jonathan Lewis
http://www.jlcomp.demon.co.uk

  The educated person is not the person
  who can answer the questions, but the
  person who can question the answers -- T. Schick Jr


One-day tutorials:
http://www.jlcomp.demon.co.uk/tutorial.html

____UK_______April 22nd
____USA_(FL)_May 2nd
____Denmark__May 21-23rd
____Sweden___June
____Finland__September
____Norway___September

Three-day seminar:
see http://www.jlcomp.demon.co.uk/seminar.html
____UK_(Manchester)_May x 2
____Estonia___June (provisional)
____Australia_June (provisional)
____USA_(CA, TX)_August

The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html


"D.Y." <dyou98_at_aol.com> wrote in message
news:f369a0eb.0304211322.21f44f6c_at_posting.google.com...


> bchorng_at_yahoo.com (Bass Chorng) wrote in message

news:<bd9a9a76.0304211032.681cd7ba_at_posting.google.com>...


> > "Jonathan Lewis" <jonathan_at_jlcomp.demon.co.uk> wrote in message

news:<b7mull$l1b$1$8300dec7_at_news.demon.co.uk>...


> ...

> > > Sounds like the nested loop is it.  You should get

> > > a noticeable improvement by doing the join  (hinted

> > > very carefully to do exactly what you want) and

> > > then using array fetches of a few dozen to a few

> > > hundred rows at a time.  It's messier to code

> > > at the boundary conditions than one row at a

> > > time - but significantly more efficient.

> >

> >   I think nested loop gives you scalability too.

> >   If I use hash join, I have to calculate the size

> >   of hash_area_size to make sure it performs right.

> >   This makes the memory requirement unscalable.

> >

>


> I am not sure nested loop is the best choice, simply because it

> requires many more times logical I/Os (compared to hash join),

> and most of them will be turned into physical I/Os for tables and

> indexes this large. If you can pull all the colums you need in one

> join, and your temp tablespace is not created on a very, very slow

> drive, hash join should be the winner. Of course your temp

tablespace


> needs to be large enough. If nested loop turns out to be faster for

> joining these tables I'd be very interested to know.

Received on Mon Apr 21 2003 - 16:52:11 CDT