Re: UNION ALL and SELECT DISTINCT

From: David Cressey <dcressey_at_verizon.net>
Date: Wed, 05 Apr 2006 18:20:25 GMT
Message-ID: <JTTYf.6765$aW5.3238_at_trndny07>


"Marshall Spight" <marshall.spight_at_gmail.com> wrote in message news:1144250677.666808.22650_at_v46g2000cwv.googlegroups.com...
> David Cressey wrote:
> >
> > I think it would have been nicer if they had adopted the same convention
in
> > both cases. I like the way UNION does it. So I'd like it if SELECT,
by
> > default, eliminated duplicates. SELECT ALL could be used for what is
now
> > meant by SELECT.
>
> I agree, but I think I would want it taken farther. SQL's approach of
> we-use-bags-but-we-pretend-they're-sets is screwey. Why not just
> use sets? Bags strike me as being of limited utility compared to sets.
> And I'd rather have-sets-but-implement-bags-with-them for those
> ultra-rare occasions when I want bags.
>
>
I have kinda toyed with the idea of sets as sets myself. Thinking in SQL terms (pardon me for doing that, it's just a convenient language for me to express this stuff in). The way I think of it is that EVERY table would be stored in what I'll call "GROUP BY *" mode. In other words, each distinct row is only stored once.

For tables that allow duplicates, there's an extra item of data, hidden behind the scenes, called the duplicate count. When an insert of a duplicate happens, the count gets incremented. and so on. The trouble is, it means that the over head of removing duplicates at insert time becomes unavoidable.

For an SQL table with a declared PRIMARY KEY, that happens anyway. And not declaring a PRIMARY KEY when there is at least one candidate key is probably a bad idea.

> I have on occasion been quite surprised to discover a SELECT DISTINCT
> working faster than a SELECT that produced identical results. The
> mental
> model of "sets are like bags but you have the added work of removing
> duplicates" is obviously not valid.
>

Yes, this does happen, although rarely. The reason is that the optimizer makes an ESTIMATE of the IO's that an indexed lookup will cost, and the estimate can be way off.

When you do a SELECT DISTINCT, the strategy will probably involve some kind of sort. A smart optimizer will realize that a sorted index can obviate the need for a sort. This may result in
SELECT DISTINCT using the index, but SELECT declining to use the index.

Now, if the use of the index turns out to be MUCH more beneficial than the optimizer estimated that it would be, the SELECT DISTINCT can cost less.

This whole discussion is obliquely related to a topic you brought up months ago.

In that topic, you were discussing something about the mathematics behind projection being somehow different than the mathematics behind restriction and join. I'm hazy on the details, but that discussion is connected to this one, unless I miss my guess. Received on Wed Apr 05 2006 - 20:20:25 CEST

Original text of this message