Re: Join Two Extremely Large Tables

From: D.Y. <dyou98_at_aol.com>
Date: 22 Apr 2003 13:48:15 -0700
Message-ID: <f369a0eb.0304221248.75fc4371@posting.google.com>

I did miss the requirement that the output is to be sorted. Like you said, an "order by" should do it. Depending on the parallel degree and how the tables are partitioned, hash join with "order by" could do a much better job than one might believe.

The point you raised about a small change in inputs causing a large change in join times is new to me, if you are not talking about the cardinality of the join columns. I'd be interested to know what your thoughts are.

"Jonathan Lewis" <jonathan_at_jlcomp.demon.co.uk> wrote in message news:<b81p43$bcn$1$8302bc10_at_news.demon.co.uk>...
> You are quite right about the quantity of
> logical I/O, and probably physical I/O, but
> I think you probably missed the start of the
> thread. A key feature of the processing is the
> need to collate all the 'B' rows for a given 'A'
> row. The only way to do this through a hash
> join is through an explicit 'order by'.
>
> Whilst an unsorted output could be generated
> quite rapidly by a hash join (given a hash_area_size
> somewhat larger than the first input) the cost
> of the subsequent necessary sort could easily
> outweigh the benefit of the hash.
>
> As the OP also says, the nested loop solution
> has better scalability (although I would have used
> the term "predictability"). Hash joins, like sorts,
> have a 'catastrophe' performance profile - there
> are two points where a small change in the inputs
> results in a large change in the response time.
> Sometimes it is simply better to go for the consistent,
> slow, solution, rather than live on the edge with a
> solution that is sometimes faster, and sometimes
> much slower.
>
>
> --
> Regards
>
> Jonathan Lewis
> http://www.jlcomp.demon.co.uk
>
> The educated person is not the person
> who can answer the questions, but the
> person who can question the answers -- T. Schick Jr
>
>
> One-day tutorials:
> http://www.jlcomp.demon.co.uk/tutorial.html
>
> ____UK_______April 22nd
> ____USA_(FL)_May 2nd
> ____Denmark__May 21-23rd
> ____Sweden___June
> ____Finland__September
> ____Norway___September
>
> Three-day seminar:
> see http://www.jlcomp.demon.co.uk/seminar.html
> ____UK_(Manchester)_May x 2
> ____Estonia___June (provisional)
> ____Australia_June (provisional)
> ____USA_(CA, TX)_August
>
> The Co-operative Oracle Users' FAQ
> http://www.jlcomp.demon.co.uk/faq/ind_faq.html
>
>
> "D.Y." <dyou98_at_aol.com> wrote in message
> news:f369a0eb.0304211322.21f44f6c_at_posting.google.com...
> > bchorng_at_yahoo.com (Bass Chorng) wrote in message
> news:<bd9a9a76.0304211032.681cd7ba_at_posting.google.com>...
> > > "Jonathan Lewis" <jonathan_at_jlcomp.demon.co.uk> wrote in message
> news:<b7mull$l1b$1$8300dec7_at_news.demon.co.uk>...
> ...
> > > > Sounds like the nested loop is it. You should get
> > > > a noticeable improvement by doing the join (hinted
> > > > very carefully to do exactly what you want) and
> > > > then using array fetches of a few dozen to a few
> > > > hundred rows at a time. It's messier to code
> > > > at the boundary conditions than one row at a
> > > > time - but significantly more efficient.
> > >
> > > I think nested loop gives you scalability too.
> > > If I use hash join, I have to calculate the size
> > > of hash_area_size to make sure it performs right.
> > > This makes the memory requirement unscalable.
> > >
> >
> > I am not sure nested loop is the best choice, simply because it
> > requires many more times logical I/Os (compared to hash join),
> > and most of them will be turned into physical I/Os for tables and
> > indexes this large. If you can pull all the colums you need in one
> > join, and your temp tablespace is not created on a very, very slow
> > drive, hash join should be the winner. Of course your temp
> tablespace
> > needs to be large enough. If nested loop turns out to be faster for
> > joining these tables I'd be very interested to know.
Received on Tue Apr 22 2003 - 15:48:15 CDT