Re: A Normalization Question

From: Neo <neo55592_at_hotmail.com>
Date: 12 Jul 2004 15:18:09 -0700
Message-ID: <4b45d3ad.0407121418.2a26f625_at_posting.google.com>


> > With respect to dbs, normalization is the process of eliminating or
> > replacing duplicate things with a reference to the original thing
> > being represented.
>
> Nope. That's just tokenizing data.

You have crossed to the physical layer when talking about tokenizing data. How a logical reference is implement physically is irrelevant at the logical level. If a web page allowed me to easily draw lines from one point to another, then I could show the refs as ---->'s instead of ->#'s.

> If this is normalization, can you tell me what data
> anomalies you avoid by doing this? I know Jan already
> asked you this, but apparently I missed your answer.

I'll repeat it. While there are three different things (person, color, street), each of them is named by the same string 'brown' and that string is redundant. To create an update anomaly, we need to perform an update. This is unusual with strings, because strings typically don't change and if they change we consider them to be a completely different string.

I realize the following are unusual examples, however a general data model can't (application above it can) have prejudices as to what updates are unusual. Suppose, the world is taken oven by Islam and they desire every string in a computer to be spelled backwards, thus 'brown' needs to be updated to 'nworb'. Or suppose, the French take over and want every string to end in an additional 't'. In the above tuple, updating one and not the others, creates an update anomaly. Below is approximately how XDb2 normalizes the three strings and updating it from 'brown' to 'nworb' or 'brownt' does not create an update anomaly.

Thing Person Color Street
1 ->2 ->3 ->4

Person Name
2 ->5

Color Name
3 ->5

Street Name
4 ->5

String Sym1 Sym2 Sym3 Sym4 Sym5 ....
5 ->6, ->7, ->8, ->9, ->10

Symbol

6      b
7      r
8      o
9      w
10     n

> > Within the context of a db, duplicate references are not considered
> > redundant because they are unrelated to the thing being represented.
>
> Ah, how convenient for you.

It isn't a matter of convenience. It's a matter of separation between logical and physical layer.

> How about numbers? Do you tokenize those? Given the
> following schema:
>
> create table person_pets (
> person_id numeric(5) not null,
> number_of_dogs int,
> number_of_cats int,
> number_of_fish int,
> primary key (person_id)
> )
>
> Given the tuple (12345, 2, 2, 2) what do you do with
> those horrific redundant 2's?

Below is approximately how XDb2 normalizes the redundant 2's.

Thing Person Dogs Cats Fishes
1 ->2 ->30 ->40 ->50

Person Name
2 ->5

Integer Symbol

30      ->60
40      ->60
50      ->60

String  Sym1 Sym2 Sym3 Sym4 Sym5 ....
5       ->6, ->7, ->8, ->9, ->10

Symbol
6      b
7      r
8      o
9      w
10     n
...
60     2

The hardware value of refs is a hardware issue, thus irrevalent for the purpose of normalizing at the logical level. While physical refs to the same thing typically have the same value on PCs, such need not be the case on other hardware (ie brain where the ref is a probably connection to the same neuron(s) and different connections to the same neuron(s) don't have the same value. Received on Tue Jul 13 2004 - 00:18:09 CEST

Original text of this message