Re: does a table always need a PK?

From: Paul G. Brown <paul_geoffrey_brown_at_yahoo.com>
Date: 1 Sep 2003 22:46:20 -0700
Message-ID: <57da7b56.0309012146.4760f01d_at_posting.google.com>


Lauri Pietarinen <lauri.pietarinen_at_atbusiness.com> wrote in message news:<bj0da9$ue8$1_at_nyytiset.pp.htv.fi>...

> Thanks, Paul! Nice summary. However, you did not answer the question
> why some (many?) db-scientists and implementors say that bags pose NO
> problems
> what so ever and even if we came up with an efficient implementation
> standing to
> the test of the real world we would not be any happier.

   Well, when you find someone who is willing to say "there are no problems   with bags", ask them. They might simply respond that there are other   problems which are more pressing, or else that bags on balance pose no   more DBMS implementation problems than sets (as distinct from database   design and usability problems). They might also respond that none of their   customers appears to care, and problems by definition are what customers   care about.    

 [ snip ]  

> Well, you could always claim that most of the time users include at
> least one candidate key in the
> select list, so no duplicate elimination would be needed if the system
> knew those keys.

   Trouble is, you can't build systems for the most of the time. If you're   going to implement something it has to cater to the nasty corner cases: the   ones where it isn't clear where to add the duplicate discard operations but   where it is clear you're going to need a lot of 'em. (Queries sans keys).

   [ snip ]

>
> And, if the user issues, say
>
> select city
> from customers -- table with 10M rows
>
> sure he'll star getting answers immediately, but
> what on earth will he do with the result???
   

   You have two choices in this case. You either sort the { city } bag,   discarding duplicates as you go, and then return the result. Sorting   means ordering, so the last value might be first. So the result first value   won't be returned until the last has been read in. (Most DBMS engines use   this approach for UNION, DISTINCT and GROUP BY.)

   On the other hand you might build a hash table of results already   returned and probe it once for each value (except the first). This means   holding the entire set of results for the duration of the query execution.   (Some DBMSs use this hashing approach because of its nice 'time to first   row' properties.)

   The first is slow. The second consumes lots of memory. Pick one.  

 [ snip ]

> Of course this could be circumvented by adding a dummy column
> just to help the optimiser, e.g.
>
> select 'usa',name
> from usa_companies
> union
> select 'canada',name
> from canada_companies

   Or you could go with a bag algebra, get the answer you want, and require   nothing of the user writing the query. (I'm just pointing it out, in   fairness. . )

 [ snip ]

> I'm not talking about current SQL-implementations which have started to
> deal with DISTINCT only fairly
> recently. I''m just wondering if all of those hard problems are now
> solvable, after 20 years of research?
> COULD a system be built that overcomes all the negative implications if
> we really wanted to
> (not taking in accout commercial and/or political issues)?

   First, you're going to have to design a query optimizer that does the   right thing. This is not a trivial undertaking. Then you're going to have   to write a non-trivial system to run some experiments on. And *then*,   whatever your results, someone will say it had everything to do with the   quality of the implementation, not the data model. (See the comments   reproduced in an earlier post.)  

   Could a system using a set algebra be built and have similar performance   properties to one built using a bag algebra? I don't know. If someone were   to claim that they had one I would want to look at their work very   carefully because such a result would fly boldy in the face of conventional   wisdom.

   But I would encourage anyone who wants to have a go.

    KR

          Pb Received on Tue Sep 02 2003 - 07:46:20 CEST

Original text of this message