Re: Question on duplicates

From: --CELKO-- <71062.1056_at_compuserve.com>
Date: 4 Mar 2002 14:03:34 -0800
Message-ID: <c0d87ec0.0203041403.41c5c7bc_at_posting.google.com>


>> What I am having trouble understanding is, if for example, you have
a
primary key value in every relation/table, surely you will have eliminated
the problem of duplicates? so why the argument? <<

  1. Your query results can have redundant duplicates. Do those duplicates have any meaning or not?
  2. If you use an auto-numbering "feature" in your database to create those primary key values, how do you know that they are not duplicates in the reality you are trying to model? Insert the same data 100,000 times into a table using the Sybase/SQL Server family IDENTITY colukmn for a key and watch your data integrity get instantly destroyed.

When the question of duplicates came up in SQL committee, we decided to leave it in the standard. The example we used internally, and which Len Gallagher used in a reply letter to DATABASE PROGRAMMING & DESIGN magazine, was a cash register receipt with multiple occurrences of cans of cat food on it. That is how this got to be the "cat food problem" in the literature.

The fundamental question is; What are you modeling in a table? Codd and Date's position is that a table is a FACT. Dr. Codd pointed out that, once one states that a fact is true, it serves no useful purpose to state it again.

The other position is that a table can represent an ENTITY or a RELATIONSHIP among entities. With that approach, a duplicate row means more than one occurrence of an entity.

I now can deal with different fundamental relationships among "duplicates", such as:

Identity = "Clark Kent is Superman!" We really have only one entity, but multiple expressions of it. These expression are not substitutable (Clark Kent does not fly until he changes into Superman).

Equality = "Two plus two is four." We really have only one entity with multiple expressions that are substitutable.

Equivalency = "You use only half as much Concentrated Sudso as your old detergent to get the same cleaning power!" We have two distinct entities, substitutable both ways under all conditions.

Substitutability = "We are out of wine; would you like a vodka?" We have two distinct entities, whose replacement for each other is not always in both directions or under all conditions (do not make a lemon sauce with a cup of vodka).

Dr. Codd later added a "degree of duplication" operator to his model as a way of handling duplicates when he realized that there is information in duplication that has to be handled. The DoD is not exactly a COUNT(*) or a quantity column in the relation. It does not behave like a numeric column. For example, given two tables A and B

 A B
 x y x y

  • ==== a 4 a 7 b 3 b 3

when I UNION them, I eliminate duplicates and get a result of

 A UNION B
 x y



 a 4
 b 3
 a 7

Assume that column y is replaced by the DoD and then do a UNION:

 A UNION B
 x DoD



 a 11
 b 6

See the difference? It is an operator, not a value.

Having said all of this, I try to only use duplicate rows for loading data into a SQL database from legacy sources. It is very frequent when you get data from the real world -- like cash register tapes.

Otherwise, I might leave duplicates in results because using a SELECT DISTINCT to remove them will: (1) cost too much sorting time and (2) the sort will force an ordering in the working table which results in a bad performance hit later. Received on Mon Mar 04 2002 - 23:03:34 CET

Original text of this message