Re: Bags versus sets; are they needed?

From: Peter Koch Larsen <pkl_at_mailme.dk>
Date: 3 Apr 2002 05:35:39 -0800
Message-ID: <61c84197.0204030535.76d68d2a_at_posting.google.com>


71062.1056_at_compuserve.com (--CELKO--) wrote in message news:<c0d87ec0.0204021232.43950d90_at_posting.google.com>...
> Remember that the original SQL databases were implementated on
> sequential file systems, so the duplicate rows were actually duplicate
> records under the covers. Cleaning out redundant duplicates was hard,
> but they were also seen as information.
>
> In his last book on the Relational Model, Dr. Codd introduced a
> "degree of Duplication" operator which behaves something (but exactly)
> like a GROUP BY or COUNT(*).
I would like to hear more about that one - will look for his book.
>
> When the question of duplicates came up in SQL committee, we decided
> to leave it in the standard. The example we used internally, and
> which Len Gallagher used in a reply letter to DATABASE PROGRAMMING &
> DESIGN magazine, was a cash register receipt with multiple occurrences
> of cans of cat food on it. That is how this got to be the "cat food
> problem" in the literature.
That problem could be solved easily - e.g. with a surrogate key. A more appropriate solution to this actual problem would be to have an extra field "qty", which had the number of items. Don't you ever have a "buy three pay for two" offer?
>
> The fundamental question is; What are you modeling in a table? Codd
> and Date's position is that a table is a FACT. Dr. Codd pointed out
> that, once one states that a fact is true, it serves no useful purpose
> to state it again.
>
> The other position is that a table can represent an ENTITY or a
> RELATIONSHIP among entities. With that approach, a duplicate row
> means more than one occurrence of an entity.
Again: Just have an extra field - quantity. Seems like a far better solution than just to have each item represented many times in the same table.
>
> I now can deal with different fundamental relationships among
> "duplicates", such as:
>
> Identity = "Clark Kent is Superman!" We really have only one entity,
> but multiple expressions of it. These expression are not
> substitutable (Clark Kent does not fly until he changes into
> Superman).
I'm not quite sure i did understand your point here. Could this not be done if duplicates were not allowed?
>
> Equality = "Two plus two is four." We really have only one entity with
> multiple expressions that are substitutable.
>
> Equivalency = "You use only half as much Concentrated Sudso as your
> old detergent to get the same cleaning power!" We have two distinct
> entities, substitutable both ways under all conditions.
Again, i'm not sure i understand your point. Or rather - i'm quite sure id do not understand your point!
>
> Substitutability = "We are out of Brandy; would you like a vodka?" We
> have two distinct entities, whose replacement for each other is not
> always in both directions or under all conditions (do not make a lemon
> sauce with a cup of vodka).
>
> Then we are into philosophy and arguing about about levels of
> aggregation. Is a forest more than a collection of individual trees?
Im not philosophical enough for that.
>
> Dr. Codd later added a "degree of duplication" operator to his model
> as a way of handling duplicates when he realized that there is
> information in duplication that has to be handled. The DoD is not
> exactly a COUNT(*) or a quantity column in the relation. It does not
> behave like a numeric column. For example, given two tables A and B
>
> A B
> item qty item qty
> ======== ========
> a 4 a 7
> b 3 b 3
>
> when I UNION them, I eliminate duplicates and get a result of
>
> A UNION B
> item qty
> ========
> a 4
> b 3
> a 7
>
> Assume that the quantity is replaced by the DoD and then do a UNION:
>
> A UNION B
> item DoD
> =========
> a 11
> b 6
>
> See the difference? It is an operator, not a value.
I do. But not an essential operator. You could do this (dmittedly ugly) instead:
SELECT a,sum(qty) FROM (SELECT (1,a,qty) FROM A UNION (2,a,qty)) GROUP BY a.
>
> Having said all of this, I try to only use duplicate rows for loading
> data into a SQL database from legacy sources. It is very frequent when
> you get data from the real world -- like cash register tapes.
>
> Otherwise, I might leave duplicates in results because using a SELECT
> DISTINCT to remove them will: (1) cost too much sorting time and (2)
> the sort will force an ordering in the working table which results in
> a bad performance hit later.
If the original query does contain duplicates, you must of course sort your data (or using some other means to avoid duplicates). But think of the enduser (or program) to. It will have to perform the elimination instead (or work on the same data several times). So is this not just a question of where to place the workload? I do not seee how the projection will force a particular sort order per se. I thus do not quite see where the performance should occur. I do however see that a performance-hit could occur later if redundancy was NOT removed in the earlier stage.

As an aside i did a search for "cat food problem" on Google, and i can report that one of your articles came up in front. The second most interesting was more involved with porches and pets, and not to databases. Could you provide me with a link to more information about this problems? I would wag my tail then ;-))

Kind regards
Peter Koch Larsen Received on Wed Apr 03 2002 - 15:35:39 CEST

Original text of this message