Re: duplicate rows

From: --CELKO-- <71062.1056_at_compuserve.com>
Date: 24 Feb 2002 08:50:26 -0800
Message-ID: <c0d87ec0.0202240850.428d7b88_at_posting.google.com>

>> 'Why Duplicate rows are prohibited' ...I am just wondering what you all think of the matter? <<

You might want to look at the website www.dbdebink.com for an update of of Chris date's work. I defend them in my books. Let me give you a section of DATA & DATABASES:

04.6 Duplicates

"Entities are a state of mind. No two people agree on what the real world view is." = Mexaxides

" ... an n-ary relation R has the following properties: ... (3) All rows are distinct... " E. F. Codd

The idea of a key on a table is central to the relational model. The purpose of a key is to identify each row uniquely, and, from that property, you can build the rules for the normal forms. The terminology has changed a bit since Codd's first papers. All of the things that Codd was calling primary keys now have more precise terms: All possible keys in a table are called candidate keys; the chosen one is the primary key and the unchosen ones are the alternate keys. The ideas of duplication and uniqueness are central to the way that people think and deal with the world, so your database model should handle these concepts if it is going to reflect the real world.

In the real world, no two entities are exactly alike, but you ignore the differences in order to construct classes and reason with abstractions. You can build the classes using some criteria for matching entities against each other. There are several ways to do this:

Identity = "Clark Kent is Superman!" You have two or more names for exactly the same entity. This is the strongest form of matching.

Equality = "Five kilograms of rubber weigh the same as five kilograms of gold." You have an attribute in both entities that has the same value according to some test or scale. However, the entities are separate and might not match on other attributes, such as current market price or electrical conductivity.

Equivalency = "One teaspoon of concentrate makes one cup of formula." One entity can be transformed into the other in some well-defined manner. This is not quite the same as equality, because some outside agency is needed for the transformation. In this case, you must add water to the concentrate.

Substitutability = "We don't have beer; would you like a glass of wine?" One entity can replace the other in some operation, but it has a distinct identity of its own and is not transformed and does not have to have an exact match on an attribute (i.e. beer and wine do not taste alike, but both in the super class of potables).

Relational databases are built on the assumptions of identity or equality. How you treat duplicates in a relational database depends on whether you use identity or equality. I'll explain this statement by looking at the three methods used in relational database models to handle duplication:

Remove duplicates automatically.
Allow duplicates.
Reduce duplicates to a count of the members in the class.

04.6.1 Allow Duplicates

This is the SQL solution. The rationale for allowing duplicate rows was best defined by David Beech in an internal paper for the ANSI X3H2 Committee and again in a letter to DATAMATION ("New Life for SQL," February 1989). This is now referred to as the "cat food argument" in the literature. The name is taken from the example of a cash register slip, where you find several rows, each of which lists a can of cat food at the same price. To quote from the original article:

"For example, the row 'cat food 0.39' could appear three times [on a supermarket checkout receipt] with a significance that would not escape many shoppers...At the level of abstraction at which it is useful to record the information, there are no value components that distinguish the objects. What the relational model does is force people to lower the level of abstraction, often inventing meaningless values to be inserted in an extra column whose purpose is to show what we knew already, that the cans of cat food are distinct."

All cans of cat food are interchangeable, so they have no natural unique identifier. The alternative of tagging every single can of cat food in the database with a unique machine-readable identifier which is pre-printed on the can or keyed in at the register is not only expensive and time-consuming, but it adds no real information to the data model. In the real world, you collect the data as it comes in on the cash register slip, and consolidate it when you debit the count of cans of cat food in the inventory table. The cans of cat food are considered equivalent, but they are not identical.

You also encounter this situation when you do a projection on a table and the result is made up of non-key columns. Counting and grouping queries also implies that duplicate rows exist in a "separate, but equal" way; that is, you treat them as a class or a multiset. Let's make this more concrete with the following two tables:

CREATE TABLE Personnel
(emp CHAR(30) NOT NULL PRIMARY KEY,
dept CHAR(8) NOT NULL);

CREATE TABLE Automobiles
(owner CHAR(30) NOT NULL,
tag CHAR(10) NOT NULL,
color CHAR(5) NOT NULL,
PRIMARY KEY(owner, tag));

You can use these tables to answer the question: "Do more employees in the accounting department than in the advertising department drive red cars?" You can answer this quickly with the following query:

SELECT dept, COUNT(*)
FROM Personnel, Automobiles
WHERE owner = emp
AND color = 'red'
AND dept IN ('acct', 'advert')
GROUP BY dept;

Try to do this without knowing that people can own more than one car and that a department has more than one employee! Duplicate values occur in both projections and joins on these tables.

04.6.2 Disallow Duplicates

This is Chris Date's relational model. Date has written several articles on the removal of duplicates (for example, "Toil and Trouble", DATABASE PROGRAMMING & DESIGN, January 1994, and "Why Duplicates are Forbidden," Chapter Four, RELATIONAL DATABASE WRITINGS 1985-1989, Addison-Wesley, 1990). Date's model says that values are drawn from a particular domain, which is a set of scalars. This means that when a column defined on the color domain uses the value "red", it is using "red" in the domain and it occurs once. There might be many occurrences of references to "red" or "3" or "1996-12-25," but they are pointing the only red, the only integer three, and the only Christmas Day in 1996. Domains are based on an identity concept and disallow duplicate values. This is the same argument that mathematicians get into about pure numbers.

Date's example of the problems of duplicates uses the following two tables:

 Parts          SupParts 
 pno pname      supno  pno 
 ==========     ========== 
 p1   screw     s1     p1 
 p1   screw     s1     p1 
 p1   screw     s1     p2

p2 screw

and then attempts to write SQL to comply with the criterion: "List the part numbers of screws or parts that come from supplier s1 (or both)." He produces a dozen different queries that are all different and all produce a different number of answers. For example, if you assume that a part must have one supplier, you can write:

SELECT P.pno
FROM Parts AS P, SupParts AS SP
WHERE (SP.supno = 's1' AND SP.pno = P.pno)

OR P.pname = 'screw';

which gives the result:
pno
===
p1
p1
p1
p1
p1
p1 9 duplicates
p1
p1
p1
p1
p1

p2
p2 3 duplicates
p2

However, the more direct query that translates an OR into a UNION would give:

SELECT pno
FROM Parts
WHERE pname = 'screw'
UNION
SELECT pno
FROM SupParts
WHERE supno = 's1';

with the desired results:

pno
===
p1
p2

The real problem is that you can assign no meaning to the duplicates of (p1, screw) and (s1, p1) rows. If the Parts table models a list of shipments as they come in, then each row would be a shipment (which is what you did with cat food). But if you assume that SupParts is a similar shipment list, then each row in Parts would have to map to a row in SupParts. They don't match up, so the data is wrong. Even if they did match up, you would have put one fact in two places, which is always an incorrect model.

To answer the question, "Which supplier sent us the most screws?" you must have an explicit quantity column in Date's relational model. This is consistent with the idea that any property of an entity should be shown as an explicit column in a table.

In SQL, you can get the effects of duplicate elimination with SELECT DISTINCT, aggregate functions with the DISTINCT option, the UNION operator, and careful design of queries around joins on UNIQUE and PRIMARY KEY columns. It is also possible for an optimizer to eliminate duplicates in certain subquery predicates, such as the IN predicate.

04.6.3 Consolidate Duplicates

Dr. Codd sent a paper to the ANSI X3H2 Database Standards Committee several years ago in which he proposed a "degree of duplication" function, which I will show as DOD(*), in keeping with SQL syntax conventions. He wanted the function to return the number of duplicates as part of the results, when you execute a query that produces duplicate rows instead of dropping them. This function could produce the same results as the GROUP BY and COUNT(*) in SQL, but in a different manner. Let's do the automobile problem ("Do more employees in the accounting department than in the advertising department drive red cars?") in relational algebra, with a dod(*) function:

Q1 := PROJECT Automobiles(owner, color, dod(*))

WHERE color = 'red';

Q2 := PROJECT Personnel(emp, dept)

WHERE dept = 'acct' OR dept = 'advert';

Q3 := Q1 JOIN Q2 ON emp = owner;

Q4 := PROJECT Q3(dept, dod(*));

Assume you have the following data:

 Automobiles               Personnel 
 owner    tag  color       emp     dept 
 =====================     ================ 
 'John'   123  'red'       'John'  'advert' 
 'John'   122  'green'     'Sam'   'acct' 
 'Sam'    111  'red'       'Mary'  'acct' 
 'Mary'   345  'red'       'Mel'   'sales'

'Mary' 678 'red'

The intermediate steps look like this:

Q1 - get the 'red' car owners
owner color dod(*)

 'John'   'red'     1 
 'Sam'    'red'     1 
 'Mary'   'red'     2 

 Q2 - get the employees 
 emp     dept

 'John'  'Advert' 
 'Sam'   'acct' 
 'Mary'  'acct'

Q3 - join them
owner color dod(*) emp dept

 'John'   'red'     1    'John'  'advert' 
 'Sam'    'red'     1    'Sam'   'acct' 
 'Mary'   'red'     2    'Mary'  'acct'

Q4 - retain departments and counts
dod(*) dept

 1     'advert' 
 1     'acct' 
 2     'acct'

Oops, the dod(*) cannot operate like a regular function value in joins and projections! You need a rule that says that rows differing by only a dod(*) column are replaced automatically with a single row containing the sum of their dod(*) columns. In this case, what you want is:

dod(*) dept

 1      'advert' 
 3      'acct'

I am not going to try to figure out all the rules for those cases in which a JOIN produces two or more dod(*) columns in intermediate result tables, or in which you try to join two intermediate results on their respective dod(*) columns.

This approach recognizes that there are occurrences of values, but it puts them into a single collection. That is, where SQL sees three separate but equivalent cans of cat food, Date's model sees a single class of canned cat food, and Codd's model sees a collection of three cans of cat food.

04.6.4 Row Identifiers

The solution for the problems of duplicates for many SQL products is to expose some of the physical representation of the tables to the users. Oracle lets their SQL see and record the physical locations of the rows in the tables on the disk. Informix has a similar construct. The ROWID guarantees that all rows are unique and you can remove redundant duplicates by keeping the one in the group with the lowest ROWID value. But there are serious problems with a row identifier.

First, you are now committed to a physical storage model which has to put rows in physical pages; you cannot have a model which uses bit vectors, inverted lists, dynamic hashing or other techniques which do not have a table represented as a contiguous storage of actual values. These approaches work better with VLDBs (Very Large Databases) and with distributed databases.

Reorganization of the database changes these row ids, so programs that saved them are made invalid. Building a distributed database becomes considerably harder because of different physical devices with similar addresses.

There is also a philosophical question about changing the primary key of a row and not the row identifier. Object Oriented people have the concept of a object identifier which remains constant in the face of changes to all the other attributes, including key attributes. If I treat the row identifier like a object identifier, then I have to ask more questions. If I have an axe and change the handle, is it still the same axe? If I then change the head, is it still the same axe? If I replaced the whole axe all at once, is it still the same axe?

Another approach is to allow a column in a table to be declared with a special property that automatically numbers the rows. In the Sybase family, they use the keyword IDENTITY on a column. But maintaining this column is another issue in itself.

A third approach is to provide a function which numbers a column in a set of rows. This is the NUMBER(*) function in WATCOM SQL. While not strictly a column property, it can be used to attach a unique identifier to a table. You have to ensure uniqueness with a column constraint on the table. The problem is that the numbering is dependent upon the physical ordering of the data in the sets, which can be completely random.

These approaches all seem to fail when the technology changes because they tie the backend to a traditional "files and records" data model.

04.6.5 Uniqueness

There is a rule of thumb in SQL database design that says that all base tables should have a primary key. This will avoid a lot of the problems involved with duplicates and will make your SQL database look more like a pure relational database. However, there is a little problem in that a primary key in relational database theory means one chosen from several candidate keys. In SQL, the keywords PRIMARY KEY
(I will lowercase the keywords to differentiate the SQL from the
relational theory term) imply other things.

In SQL, I might not want to have a PRIMARY KEY, but instead use multiple candidate keys (shown by NOT NULL and UNIQUE constraints) because many systems will see the keywords PRIMARY KEY and set up access based on those columns. The PRIMARY KEY is the default target for the FOREIGN KEY...REFERENCES constraint clause and many SQL products will set up special index structures (clustering, hashing, inverted lists, and so on) to favor joins and accesses on the PRIMARY KEY for the referencing table to those columns. Consider a table for a school schedule:

CREATE TABLE Schedule
(period INTEGER NOT NULL,

teacher CHAR(15) NOT NULL,
room INTEGER NOT NULL,

 CONSTRAINT tr UNIQUE (teacher, room), -- candidate keys 
 CONSTRAINT pr UNIQUE (period, room), 
 CONSTRAINT pt UNIQUE (period, teacher),

CONSTRAINT ptr UNIQUE (period, teacher, room));

Yes, the rules imposed by the UNIQUE constraints are a bit weird, but bear with me. The following is one possible solution set that does not violate any of the four constraints:

Schedule
period teacher room

    1    'Curly'    101 
    1    'Larry'    102 
    1    'Moe'      103 
    2    'Curly'    102 
    2    'Larry'    101 
    3    'Curly'    103 
    3    'Moe'      101

I constructed this table by attempting to insert all 27 possible rows
(3 teachers, 3 rooms, and 3 periods) into the table. This is a handy,
if inelegant, testing trick for a table with multiple constraints.

Which UNIQUE constraint should be made into the PRIMARY KEY? And how did you decide? The relational model does not have to worry about performance, but you do. At first glance, it looks like the ptr constraint implies the other three constraints; but it does not. The ptr constraint by itself would allow all 27 possible rows to appear in the table.

Using the ptr constraint as a PRIMARY KEY violates another rule of thumb: A PRIMARY KEY should not be a super key. A super key is defined as a set of columns that is a key and also contains a subset that is a key (in other words, it's too fat). This rule of thumb has the practical advantage of leading the designer toward shorter keys, which will become smaller indexes, which will search faster. It sounds good so far.

If you follow this rule, which of the two column constraints do you use as the PRIMARY KEY? Unless I have a good reason, I would assume that searches are equally likely to use any pair or all three columns. That means I would use all four constraints, but not declare any of them to be the PRIMARY KEY. Yes, updates and insertions would be slow, but my searches would have the best average search time possible.

The usual next step is to try to break this table into smaller tables and then reconstruct the original table from them. However, this does not work. Break the table into three two-column tables, each of which uses a two-column constraint as its PRIMARY KEY, and insert some legal rows into each of them.

CREATE TABLE period_teacher
(period INTEGER NOT NULL,
teacher CHAR (15) NOT NULL,
PRIMARY KEY (period, teacher));

INSERT INTO period_teacher

 VALUES (1, 'Curly'), 
        (1, 'Larry'), 
        (1, 'Moe'), 
        (2, 'Curly'),
        (2, 'Larry'), 
        (3, 'Larry'), 
        (3, 'Moe');

Now execute a query to join the three tables together, replacing the ptr constraint with a SELECT DISTINCT. (I have used a GROUP BY instead to get the count of the duplicates for each row.)

CREATE TABLE period_room
(period INTEGER NOT NULL,
room INTEGER NOT NULL,
PRIMARY KEY (period, room));

INSERT INTO period_room

 VALUES (1, 101), 
        (1, 102), 
        (1, 103), 
        (2, 101), 
        (2, 102), 
        (3, 101), 
        (3, 103);

CREATE TABLE teacher-room
(teacher CHART(15) NOT NULL,

room INTEGER NOT NULL,
PRIMARY KEY (teacher, room));

INSERT INTO teacher-room

 VALUES ('Curly', 101), 
        ('Curly', 102), 
        ('Curly', 103), 
        ('Larry', 101), 
        ('Larry', 102), 
        ('Moe', 101), 
        ('Moe', 103);

The idea is to reconstruct the original table from these three derived tables. But your query will not work and you will get a lot of false data that did not exist in the original table. In fact, you can try the other three possible table joins and still not avoid false data.

SELECT PT.period, TR.teacher, PR.room, COUNT(*) AS dups FROM period_teacher AS PT,

        period_room AS PR,
        teacher_room AS TR
  WHERE PR.period = PT.period

    AND TR.room = PR.room
    AND TR.teacher = PT.teacher
  GROUP BY PT.period, TR.teacher, PR.room;

Results
period teacher room

   1    'Curly'   101
   1    'Curly'   102
   1    'Curly'   103
   1    'Larry'   101
   1    'Larry'   102
   1    'Moe'     101
   1    'Moe'     103
   2    'Curly'   101
   2    'Curly'   102
   2    'Larry'   101
   2    'Larry'   102
   3    'Larry'   101
   3    'Moe'     101
   3    'Moe'     103

As an aside, many first-time SQL database designers working with entity-relationship modeling tools think that, "if it is a table, then it is an entity." Look at the Schedule table and there is nothing that implies an entity in the possible keys; rather, it holds nothing but relationships among three entities.

04.6.6 Levels of Aggregation

Chuck Reinke wrote to C. J. Date in response to his column in DATABASE PROGRAMMING & DESIGN magazine which argued against the cat food example with another example. He was interested in lab rats and ratlets. Whenever there is a new litter, he wanted to create an entry for each new ratlet. When just born, they are indistinguishable, but tattooing is out of the question. Yet, as they grow, he can distinguish certain ones by color or behavior and at that time, we need to assign them a unique key. This is usually done with colored markers on their tails.

Mr. Reinke argued that Chris not record the ratlets prior to the stage when they become distinguishable from their litter mates. Assigning each ratlet an arbitrary unique key implies non-existent information that the ratlets are distinguishable (in the mind of God, perhaps).

Chris Date responded with the design for the ratlets problem:

CREATE TABLE Litters
(litter_id INTEGER NOT NULL PRIMARY KEY,
ratlet_tally INTEGER NOT NULL);

CREATE TABLE Ratlets
(ratlet_id CHAR(15) NOT NULL PRIMARY KEY,
litter_id INTEGER NOT NULL

REFERENCES Liters(litter_id));

When there's a new litter, we make the obvious entry in litters. When an individual ratlet becomes "interesting" (unlike Reinke, Date did not like the word "distinguishable" because distinguishability presupposes identity), we make the obvious entry in ratlets.

Let us consider what we are actually doing in this case. One table is viewing a litter of ratlet as a single entity to be modeled in a table, while the other table is breaking the litter into its individual members. The litter table is a set based on a collective noun and the ratlet is based on a singular noun, if you want to think of it in those terms.

You see this same model in the GROUP BY operator, which creates a set at a higher level of aggregation from existing data. It is not a bad thing, per se, but you need to be aware of it and not mix levels of aggregation in a query or in the same table. Received on Sun Feb 24 2002 - 17:50:26 CET

This message: [ Message body ]
Next message: --CELKO--: "Re: sql query?"
Previous message: Asvin Ananthanarayan: "Re: implementation of kdb trees in C++ on linux"
In reply to: Jayne Heger: "duplicate rows"
In reply to --CELKO--: "Re: Newbie Master-Detail Updates"
Next in thread: --CELKO--: "Re: sql query?"
Reply: Henry Feinman: "Re: duplicate rows"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message