Re: Outer Join vs. Function call

From: Laconic2 <laconic2_at_comcast.net>
Date: Wed, 4 Aug 2004 12:09:17 -0400
Message-ID: <4PudnSV7mrAplIzcRVn-tA@comcast.com>

"tnabil" <tarek__at_hotmail.com> wrote in message news:f6ce088b.0408040640.83c128_at_posting.google.com...
> Thanks ed. I wasn't really expecting this answer, but I come from the
> development world, so maybe databases are different. I always thought
> that doing the join once and for all, would be faster than executing a
> query for every single row. On second thoughts, maybe such a judgement
> needs more knowledge of the internals of the database engine, so like
> you said, it's database specific. I think I'll take your advise and do
> some testing and try to figure out which is faster.

Without getting into database specifics, I want to point out that database engines are merely software layered on top of hardware, just like many other systems. The algorithms for sorting and searching are very important. IIRC, Volume one of Knuth's book was about exactly that: sorting and searching.

With regard to joins, there are two join algorithms you should at least have a passing acquaintance with. My own knowledge is limited to describing these algorithms. If I had to implement them, I'd have to think long and hard, even though I was once a "software hotshot" in a former lifetime.

The first algorithm is the "loop join". It sounds like you were thinking of this algorithm in your post. In this algorithm, we form an outer loop, composed of a few entries that are to be selected. Then, for each entry in the outer loop, we look up one or more matching entries, in the inner loop.

The second algorithm is the "merge join". In this algorithm, we access both tables, in the same order. If there's a sorted index on the matching column, on both tables, then no sorting is needed. All we have to do is read the rows in the order presented by the index. The reason it's called a merge join is that the algorithm, in detail, looks much like the algorithm for merging two (sorted) data streams together.

Let's say we got two tables, ORDERS and ORDER_ITEMS. Let's say we have sorted indexes on ORDER_NUMBER on both tables. Naturally, the index on ORDERS can forbid duplicates, while the index on ORDER_ITEMS has to permit duplicates.

Now, in this case, which algortihm is faster? It depends.

Let's say we want to look up a single order. This happens in OLTP systems a lot. The loop join is probably faster.
The outer loop will find a single order number, and that means the inner loop will have to probe the index on ORDER_ITEMS just once. This is true even if we have to scan the order table, based on CUSTOMER_ID and ORDER_DATE. Now let's say we want a report for all the reports, with details for April. The merge join is probably faster. With hundreds of orders to process, walking the index on ORDER_ITEMS once beats the heck out of doing hundreds of probes.

All this ought to be clear to a software engineer, even though my description is a bit on the informal side.

Who should pick between the two algorithms? In general, the optimizer inside the database engine should.

A really good optimizer is really good at generating alternate strategies for producing the same result, and at estimating which strategy will result in the least cost.

Again, without getting product specific, I know from experience of a product whose optimizer would make the right choices in the cases outlined above. And this product had a decent optimizer even 15 years ago.

The optimizer is your friend (unless you have a bad DBMS product).

>
> I have a small question, though. How do I know how long it takes to
> execute a single query? I use toad, which displays a number in
> milliseconds whenever you execute some query, but this number
> decreases if you execute the query more than once. Is this due to some
> optimizations in the engine? Which number should I consider, the
> worst?
>

Most database engines reuse memory. If the right stuff is already in some buffer in pooled memory, it doesn't do another disk access, because it doesn't have to. This causes subsequent searches to run faster. The effect is more pronounced, the more memory is used to cache recently accessed database blocks. Received on Wed Aug 04 2004 - 11:09:17 CDT