Path: news.easynews.com!easynews!news.he.net!cyclone-sf.pbi.net!64.245.249.51!sfo2-feed1.news.algx.net!allegiance!logbridge.uoregon.edu!newsfeed.stanford.edu!postnews1.google.com!not-for-mail
From: 71062.1056@compuserve.com (--CELKO--)
Newsgroups: comp.databases.theory
Subject: Re: Bags versus sets; are they needed?
Date: 2 Apr 2002 12:32:19 -0800
Organization: http://groups.google.com/
Lines: 87
Message-ID: <c0d87ec0.0204021232.43950d90@posting.google.com>
References: <61c84197.0204020335.5aa9c00f@posting.google.com>
NNTP-Posting-Host: 207.8.52.210
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: posting.google.com 1017779539 23287 127.0.0.1 (2 Apr 2002 20:32:19 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: 2 Apr 2002 20:32:19 GMT
Xref: easynews comp.databases.theory:20346
X-Received-Date: Tue, 02 Apr 2002 13:30:05 MST (news.easynews.com)

Remember that the original SQL databases were implementated on
sequential file systems, so the duplicate rows were actually duplicate
records under the covers.  Cleaning out redundant duplicates was hard,
but they were also seen as information.

In his last book on the Relational Model, Dr. Codd introduced a
"degree of Duplication" operator which behaves something (but exactly)
like a GROUP BY or COUNT(*).

When the question of duplicates came up in SQL committee, we decided
to leave it in the standard.  The example we used internally, and
which Len Gallagher used in a reply letter to DATABASE PROGRAMMING &
DESIGN magazine, was a cash register receipt with multiple occurrences
of cans of cat food on it.  That is how this got to be the "cat food
problem" in the literature.

The fundamental question is; What are you modeling in a table?  Codd
and Date's position is that a table is a FACT.  Dr. Codd pointed out
that, once one states that a fact is true, it serves no useful purpose
to state it again.

The other position is that a table can represent an ENTITY or a
RELATIONSHIP among entities.  With that approach, a duplicate row
means more than one occurrence of an entity.

I now can deal with different fundamental relationships among
"duplicates", such as:

Identity = "Clark Kent is Superman!"  We really have only one entity,
but multiple expressions of it.  These expression are not
substitutable (Clark Kent does not fly until he changes into
Superman).

Equality = "Two plus two is four." We really have only one entity with
multiple expressions that are substitutable.

Equivalency = "You use only half as much Concentrated Sudso as your
old detergent to get the same cleaning power!"  We have two distinct
entities, substitutable both ways under all conditions.

Substitutability = "We are out of Brandy; would you like a vodka?" We
have two distinct entities, whose replacement for each other is not
always in both directions or under all conditions (do not make a lemon
sauce with a cup of vodka).

Then we are into philosophy and arguing about about levels of
aggregation.  Is a forest more than a collection of individual trees?

Dr. Codd later added a "degree of duplication" operator to his model
as a way of handling duplicates when he realized that there is
information in duplication that has to be handled. The DoD is not
exactly a COUNT(*) or a quantity column in the relation.  It does not
behave like a numeric column. For example, given two tables A and B

 A         B
 item qty  item qty
 ========  ========
  a    4    a   7
  b    3    b   3

when I UNION them, I eliminate duplicates and get a result of

 A UNION B
 item qty
 ========
  a    4
  b    3
  a    7

Assume that the quantity is replaced by the DoD and then do a UNION:

 A UNION B
 item  DoD
 =========
 a      11
 b       6

See the difference? It is an operator, not a value. 

Having said all of this, I try to only use duplicate rows for loading
data into a SQL database from legacy sources. It is very frequent when
you get data from the real world -- like cash register tapes.

Otherwise, I might leave duplicates in results because using a SELECT
DISTINCT to remove them will: (1) cost too much sorting time and (2)
the sort will force an ordering in the working table which results in
a bad performance hit later.