Re: does a table always need a PK?
Date: 1 Sep 2003 12:47:43 -0700
Message-ID: <57da7b56.0309011147.5e13fa52_at_posting.google.com>
Lauri Pietarinen <lauri.pietarinen_at_atbusiness.com> wrote in message news:<3F531A01.10709_at_atbusiness.com>...
[ snip ]
> However, this is now 20 years later and we are fixed with SQL for the
> time being. So would
> you agree with me that the reasons for not "fixing" this problem
> (duplicates) has more to do
> with large installed base and commercial interests than technology?
Building non-trivial systems is certainly hard (which means expensive). The challenge you have is to justify the expense. And on a list of customer problems with 'R'DBMS products duplicate rows doesn't loom nearly as large as, say, reliability, scalability and the poor tooling we surround the products with. So that's where the investment and research goes.
>
> You must be familiar with BS12 of which Darwen was one of the architects
> (http://www.mcjones.org/System_R/bs12.html).
I was not. Interesting read.
> Of course, BS12 did not have a very long life so I suppose there is not
> so much data on how
> it faired in the "real world".
Another system sacrificed on the alter of 'standards'. (Which is a whole other rant.)
[ snip ]
> Can you help me? Why do so many posters (some of who are
> db-researchers, some of who are implementers)
> claim that the bag model poses no problems? Or is this question
> undecidable?
Well, my perception is that the situation is as follows:
Q: Boris, what's a set?
- Well Gretchen, a set is a group without duplicate elements.
- And what's a bag?
- A group that can have duplicate elements.
- What can you do with sets and bags?
- A number of operations we will call 'group operators' like selection, union, disjunction, intersection and so on.
- The same operations for both sets and bags?
- Yup. Pretty much.
- Pretty much?
- Well, the operators have different results when applied to sets and to bags.
- Can you give an example?
- Sure. Suppose I have two groups: A := {'x,'y,'z} and B := {'v,'w'x}. A and B can be either a set or a bag. Now, what happens when you combine these two groups with a union? Well, group operators are "closed". That is, whatever kind of thing(s) an operator takes as input(s) is the same kind of thing it produces as output. Sets yield sets and bags yield bags. So if A and B are bags, then A BAG_UNION B (let's call it C) := {'x,'y,'z,'v,'w,'x}. But if A and B are sets the result won't include the redundant 'x value. So the answer would be A SET_UNION B := {'v,'w,'x,'y,'z}. All of the group operators for sets and bags have this flavor.
- Boris - that's not such a big difference.
- The hell it is, Gretchen! It's a really big deal.
- Why?
- Well, remember what we're trying to do here. We're trying to build a computer program to help remember and reason about facts. Now, a fact is a singular thing. It doesn't matter how often you say it, or write it down, it's only ever true 'once'. Bags then, they're kind of conceptually inefficient vessels for facts and they can cause ambiguities. For example, how many 'things' are there in C? With a set the answer is obvious: 5. But with a bag, is the answer 5 or 6?
- But all of the group operators work with both?
- Sure. And we can figure out how the operators work with each other to build more complex expressions, how to structure them efficiently, and so on. Look - defining and implementing a bag algebra is not difficult. Implementation is more awkward because there are things you can assume about sets that you can't assume about bags, and these missing assumptions complicate the program. But it's possible.
- So, if sets are simpler to implement and are better suited for the task, why are bags used instead?
- Well, remember that little thing, 'closure'? It's really, really important. It says that whenever you take a set (or a bag), you give a set (or a bag). And if you don't, if you get it wrong and give a bag instead of a set, then a compound expression can return bogus answers. And that's very, very bad.
- I can see that. But what's it got to do with why bags are preferred over sets?
- Closure means that you can't have a bag edge between any of the physical programs used to implement a set algebra. Sometimes you can tell when an intermediate result is going to be a set from the properties of the sets involved in the compound expression. But often you can't. And when you can't you've got to start adding code to eliminate the duplicates from the edges.
- Why is that so bad? It gives you the right answer.
- It sure does. But it slows the system down, and reduces its overall scalability. It slows it down because duplicate elimination is a "blocking" operation. All of the input to the operator must be examined before the first output can be handed on. It also means that the physical operator must hold onto all the intermediate results, consuming memory as it does so. Bags don't do this. You can start giving results back straight away with a bag algebra.
- And fast is important? Isn't it more important to be right than fast?
- In my value system Gretchen, yes. But it's apparently not a widely held sentiment in this hard, cruel world. I've heard business people say that it's more important that a customer get a product, *any* product, quickly, than that they get the right product eventually.
- Oh. I see.
- Yes. Building non-trivial computer programs is expensive, and risky. And it turns out to require a lot more than a sound knowledge of the underlying theory. Folk who write big checks don't like taking risks. If you want to build a system using a set algebra you're going to have to argue that getting it right is more important than doing it fast, and that's pushing on rope where the check writers of the world are concerned.
>
> >But they have negative implications.
> > And it isn't reasonable to ignore their dark side.
> >
> It would be interesting to see a current assesment on the negative
> implications.
Take a standard query workload. Add a DISTINCT to all of the queries. Measure a) the difference in performance/throughput, and b) the difference in the results. I'm gonna bet that the DISTINCT is a lot slower and that there is little difference between the results. This is not a completely fair comparison, though, as there are queries where the difference between bag and set over an intermediate result is going to yield different answers and cases where a smart system would just shrug at the DISTINCT because it knows the results will be distinct anyway.
KR
Pb Received on Mon Sep 01 2003 - 21:47:43 CEST