Re: Extending my question. Was: The relational model and relational algebra - why did SQL become the industry standard?

From: Bob Badour <bbadour_at_golden.net>
Date: Wed, 12 Mar 2003 00:39:17 -0500
Message-ID: <qbAba.1$6A.135301_at_mantis.golden.net>


"Jan Hidders" <jan.hidders_at_REMOVE.THIS.ua.ac.be> wrote in message news:3e6e4e43.0_at_news.ruca.ua.ac.be...
> Bob Badour wrote:
> >"Jan Hidders" <jan.hidders_at_REMOVE.THIS.ua.ac.be> wrote in message
> >news:3e6cc096.0_at_news.ruca.ua.ac.be...
> >> Bob Badour wrote:
> >> >"Jan Hidders" <jan.hidders_at_REMOVE.THIS.ua.ac.be> wrote in message
> >> >news:3e6c869e.0_at_news.ruca.ua.ac.be...
> >> >> Bob Badour wrote:
> >> >> >"Jan Hidders" <jan.hidders_at_REMOVE.THIS.ua.ac.be> wrote in message
> >> >> >news:3e6bd183.0_at_news.ruca.ua.ac.be...
> >> >> >> Bob Badour wrote:
> >> >> >> >
> >> >> >> >[...] I cannot count how many times I have heard alleged
database
> >> >> >> >experts tell users to "Never use DISTINCT." If the result is
> >> >> >> >already distinct, the keyword should have no cost.
> >> >> >>
> >> >> >> Yes. *should* is the right word. Deriving that "at compile time"
is
> >> >> >> not a trival problem.
> >> >> >
> >> >> >It is trivial in a system based on sets and requiring logical
> >> >> >identity.
> >> >>
> >> >> No, it is just as difficult.
> >> >
> >> >The input to every relational operation is a set of distinct tuples
with
> >> >a given predicate and the output from every relational operation is a
> >> >set of distinct tuples with a derivable predicate. Other than
projection
> >> >and summarization, what operations have intermediate duplicate tuples?
> >>
> >> The union, if implemented it in a naive way. But is the above somehow
> >> supposed to be a proof that the problem is trivial?
> >
> >I don't need to prove that it is trivial. The onus is on you to prove
your
> >original assertion that it is not trivial.
>
> Sure, but you asserted that it isn't, so that onus is on you. The
complexity
> of the problem is discussed in the reference I already gave to Lauri:

>

> http://citeseer.nj.nec.com/paulley94exploiting.html
>
> The particular problem they study there is NP-complete.

Okay, you are right. It is an NP-complete problem to avoid duplicate removal in every conceivably avoidable situation. For instance, two range constraints may overlap on a single value or some arbitrarily complex combination of constraint expressions might ensure uniqueness in an obscure way. I was not fully considering all of those possibilities. I was, however, considering all of the situations Paulley and Larson catch with their algorithm in section 4 and a few others.

The NP-complete part doesn't disturb me much because the NP-complete issue turns up a lot when one considers completely and provably optimal solutions. Most optimization techniques use "an efficient algorithm that, although it will not discover all [possible improvements], handles a large subclass of queries" or programmes or whatever.

I note first the authors grant that no duplicate removal avoidance optimization exists if either of the input tables allows duplicate rows (section 3.2). Of course, it would be perfectly valid to compare duplicate removal immediately after a union followed by a project that then avoids duplicate removal with a deferred duplicate removal from the union until after the project.

I note next that the analysis would be much simpler without the need to account for NULL.

I note finally that the analysis would be much simpler if it only examined the project operation without the cartesian product and restriction. It would only have to examine C sub R and neither C sub S nor C sub (R, S). It would not have to consider "h". It could probably omit a few other details well. Received on Wed Mar 12 2003 - 06:39:17 CET

Original text of this message