Re: does a table always need a PK?

From: Paul G. Brown <paul_geoffrey_brown_at_yahoo.com>
Date: 1 Sep 2003 22:46:20 -0700
Message-ID: <57da7b56.0309012146.4760f01d_at_posting.google.com>

Lauri Pietarinen <lauri.pietarinen_at_atbusiness.com> wrote in message news:<bj0da9$ue8$1_at_nyytiset.pp.htv.fi>...

> Thanks, Paul! Nice summary. However, you did not answer the question
> why some (many?) db-scientists and implementors say that bags pose NO
> problems
> what so ever and even if we came up with an efficient implementation
> standing to
> the test of the real world we would not be any happier.

Well, when you find someone who is willing to say "there are no problems with bags", ask them. They might simply respond that there are other problems which are more pressing, or else that bags on balance pose no more DBMS implementation problems than sets (as distinct from database design and usability problems). They might also respond that none of their customers appears to care, and problems by definition are what customers care about.

[ snip ]

> Well, you could always claim that most of the time users include at
> least one candidate key in the
> select list, so no duplicate elimination would be needed if the system
> knew those keys.

Trouble is, you can't build systems for the most of the time. If you're going to implement something it has to cater to the nasty corner cases: the ones where it isn't clear where to add the duplicate discard operations but where it is clear you're going to need a lot of 'em. (Queries sans keys).

[ snip ]

>
> And, if the user issues, say
>
> select city
> from customers -- table with 10M rows
>
> sure he'll star getting answers immediately, but
> what on earth will he do with the result???

You have two choices in this case. You either sort the { city } bag, discarding duplicates as you go, and then return the result. Sorting means ordering, so the last value might be first. So the result first value won't be returned until the last has been read in. (Most DBMS engines use this approach for UNION, DISTINCT and GROUP BY.)

On the other hand you might build a hash table of results already returned and probe it once for each value (except the first). This means holding the entire set of results for the duration of the query execution. (Some DBMSs use this hashing approach because of its nice 'time to first row' properties.)

The first is slow. The second consumes lots of memory. Pick one.

[ snip ]

> Of course this could be circumvented by adding a dummy column
> just to help the optimiser, e.g.
>
> select 'usa',name
> from usa_companies
> union
> select 'canada',name
> from canada_companies

Or you could go with a bag algebra, get the answer you want, and require nothing of the user writing the query. (I'm just pointing it out, in fairness. . )

[ snip ]

> I'm not talking about current SQL-implementations which have started to
> deal with DISTINCT only fairly
> recently. I''m just wondering if all of those hard problems are now
> solvable, after 20 years of research?
> COULD a system be built that overcomes all the negative implications if
> we really wanted to
> (not taking in accout commercial and/or political issues)?

First, you're going to have to design a query optimizer that does the right thing. This is not a trivial undertaking. Then you're going to have to write a non-trivial system to run some experiments on. And *then*, whatever your results, someone will say it had everything to do with the quality of the implementation, not the data model. (See the comments reproduced in an earlier post.)

Could a system using a set algebra be built and have similar performance properties to one built using a bag algebra? I don't know. If someone were to claim that they had one I would want to look at their work very carefully because such a result would fly boldy in the face of conventional wisdom.

But I would encourage anyone who wants to have a go.

Pb Received on Tue Sep 02 2003 - 07:46:20 CEST

This message: [ Message body ]
Next message: byrmol: "Missing Information & Empty Sets."
Previous message: Jonathan Leffler: "Re: does a table always need a PK?"
Maybe in reply to: Lauri Pietarinen: "Re: does a table always need a PK?"
In reply to Lauri Pietarinen: "Re: does a table always need a PK?"
Next in thread: Lauri Pietarinen: "Re: does a table always need a PK?"
Reply: Lauri Pietarinen: "Re: does a table always need a PK?"
Reply: Bob Badour: "Re: does a table always need a PK?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message