Re: Bags versus sets; are they needed?

From: --CELKO-- <71062.1056_at_compuserve.com>
Date: 2 Apr 2002 12:32:19 -0800
Message-ID: <c0d87ec0.0204021232.43950d90_at_posting.google.com>


Remember that the original SQL databases were implementated on sequential file systems, so the duplicate rows were actually duplicate records under the covers. Cleaning out redundant duplicates was hard, but they were also seen as information.

In his last book on the Relational Model, Dr. Codd introduced a
"degree of Duplication" operator which behaves something (but exactly)

like a GROUP BY or COUNT(*).

When the question of duplicates came up in SQL committee, we decided to leave it in the standard. The example we used internally, and which Len Gallagher used in a reply letter to DATABASE PROGRAMMING & DESIGN magazine, was a cash register receipt with multiple occurrences of cans of cat food on it. That is how this got to be the "cat food problem" in the literature.

The fundamental question is; What are you modeling in a table? Codd and Date's position is that a table is a FACT. Dr. Codd pointed out that, once one states that a fact is true, it serves no useful purpose to state it again.

The other position is that a table can represent an ENTITY or a RELATIONSHIP among entities. With that approach, a duplicate row means more than one occurrence of an entity.

I now can deal with different fundamental relationships among
"duplicates", such as:

Identity = "Clark Kent is Superman!" We really have only one entity, but multiple expressions of it. These expression are not substitutable (Clark Kent does not fly until he changes into Superman).

Equality = "Two plus two is four." We really have only one entity with multiple expressions that are substitutable.

Equivalency = "You use only half as much Concentrated Sudso as your old detergent to get the same cleaning power!" We have two distinct entities, substitutable both ways under all conditions.

Substitutability = "We are out of Brandy; would you like a vodka?" We have two distinct entities, whose replacement for each other is not always in both directions or under all conditions (do not make a lemon sauce with a cup of vodka).

Then we are into philosophy and arguing about about levels of aggregation. Is a forest more than a collection of individual trees?

Dr. Codd later added a "degree of duplication" operator to his model as a way of handling duplicates when he realized that there is information in duplication that has to be handled. The DoD is not exactly a COUNT(*) or a quantity column in the relation. It does not behave like a numeric column. For example, given two tables A and B

 A B
 item qty item qty

when I UNION them, I eliminate duplicates and get a result of

 A UNION B
 item qty



  a 4
  b 3
  a 7

Assume that the quantity is replaced by the DoD and then do a UNION:

 A UNION B
 item DoD


 a      11
 b       6

See the difference? It is an operator, not a value.

Having said all of this, I try to only use duplicate rows for loading data into a SQL database from legacy sources. It is very frequent when you get data from the real world -- like cash register tapes.

Otherwise, I might leave duplicates in results because using a SELECT DISTINCT to remove them will: (1) cost too much sorting time and (2) the sort will force an ordering in the working table which results in a bad performance hit later. Received on Tue Apr 02 2002 - 22:32:19 CEST

Original text of this message