Re: Efficiency and usage of IN operator

From: Alexander Staubo <earlybird_at_mop.no>
Date: Tue, 1 Jun 1999 02:50:35 +0200
Message-ID: <MPG.11bd4cc9e9974880989682@news.mop.no>

In article <3751cd8a_at_news3.us.ibm.net>, bpmargo_at_ibm.net says...
> A few thoughts, though possibly not the answer you are seeking.
>
> First the quote is from the SQL Server 6.x documentation. The SQL Server 7.0
> Books Online does not appear to have a similar statement.

I noticed this. In fact the 7.0 documentation implies than an IN predicate is identical to the same statement rewritten with multiple ORs.

> It is incorrect to assume "the query will return _fewer_ rows". If it did
> that then SQL Server would not be returning a correct result set.

Right.

> Predicates with large numbers of values in the IN clause, or equivalently,
> with large numbers of OR clauses are going to be inefficient because,
> ***if*** an index is used, more rows are going to have to read to both in
> the index and then in the data. I emphasize the ***if*** because, it is
> possible that SQL Server will simply decide that, if the number of
> conditions is inordinately large, fewer total I/O's will result from simply
> doing a table scan.

Table scan meaning, I hope, _not_ a brute-force scan. That would be nothing short of disastrous.

In my case I must test for membership, so whether to use IN or whatever is a design question.

Basically, slightly abstracted, each "object" (table row) has information about which "container" (some external object) it's in. The reference is a simple ID referencing a foreign key. Containers are nested in a tree- like structure, though, so when I want to list the contents of MyContainer, I want to include the contents of all sub-containers, and sub-sub-containers of MyContainer, as well. In other words, I'd like to do a recursive query.

Example: If MyContainer has the ID 534, and its subcontainers have IDs 945 and 733, then the SQL query becomes something like...

	select ...
	from objects o
	where o.id in (534, 945, 733)

This is pretty decent for a small number of subcontainers, but for deeply nested hierarchies the list given to IN can be large.

I thought of a marginally "dirty" way of optimizing this query. Or at least I believe it can be faster. Since the containment structure is fairly static -- the tree doesn't change all that much -- I can preprocess the database. For every object I encode the containment path that can refer to the object, and reverse it so that the deepest container is listed first: So in the case above, the object has a character-type field named "preproc_ids" with the value "733,945,534,". To find an object "which is either in container 534 or any of its subcontainers", I perfer this lookup:

	select ...
	from objects o
	where o.preproc_ids like "534,%"

I believe this will work, although I'm still not getting the sheer performance of an equality comparison

Unfortunately the above solution requires the database to be pre- processed. Not very cool.

[snip]
> Bottom line, as a standard rule (and I have never known of an exception to
> this rule), given a choice between IN and multiple OR's, I always go for the
> IN, simply because it takes less typing without effecting the query plan one
> way or the other.

That seems to be the case, yes.

--
Alexander Staubo http://www.mop.no/~alex/ "It has taken the planet Earth 4.5 billion years to discover it is 4.5 billion years old." --George Wald Received on Mon May 31 1999 - 19:50:35 CDT