In article <ulkigftlfm4i62_at_corp.supernews.com>,
Manatee Morton <manatee_at_nwlink.com> wrote:
>
>What would be a good defination of normalization? Could someone explain what
>the first through third normal forms are?
So you want a "Database normalization for Dummies", preferrably in just a
few paragraphs? :-) Let me give it a try. If you allow me I will present it
as a little dialogue.
Q: Can you explain to me what normalization is?
- I will tell you in a minute, but first let me tell you something about
the relational model. A fundamental assumption of the relational model is
that all the tuples in a relation represent a statement that (we hope) is
a fact. You probably already understood this but this is just so you
understand the terminology that follows.
- Oh, ok. So will you now tell me what normalization is?
- In a minute, but first let me explain that we can have different facts
mixed into the same relation. For example, consider the following two
relations:
Has_address(name, address) and
Has_date_of_birth(name, date_of_birth)
Both represent different facts but we can mix them also into one relation:
Person(name, address, date_of_birth)
Q: Sure, so we can mix different facts into the same relation. Is that a
problem?
- Not in this case, but sometimes it can be. Suppose we mix the following
two relations:
Has_address(name, address) and
Has_hobbies(name, hobby)
That results in:
Person2(name, address, hobby)
This is not a pleasant mix if you look for example at a possible content
of the relation:
name address hobby
--------------------------------
John Broadway 3 Origami
John Broadway 3 Ice Hockey
Mary Oak Ridge 2 Wrestling
Mary Oak Ridge 2 Scuba diving
Harry Broadway 3 Cooking
Harry Broadway 3 Origami
If we compare it with the original table there are some obvious problems
here: the address information is repeated unnecessarily and we cannot
store the address of somebody who doesn't have a hobby unless we use some
trick like NULL values.
Q: I see, but what has this to do with normalization?
- Normalization is about finding out if a relation contains a bad mix or
not and what we can do about it.
- Sounds useful, so how do we do that?
- By looking at the functional dependencies and the candidate keys, and if
you really want to know that your mix is without these problems then you
also have to look at the multi-valued dependencies and the join
dependencies. But I'll save that for another lesson.
- Great. So how can we see if a mix is bad?
- For a good mix it must hold that:
if there is a functional dependency A1,...,An -> B such that
- the set {A1,...,An} is minimal, i.e., B is not functionally dependent
upon a proper subset of this set, and
- the dependency is not trivial, i.e., B is not in the set {A1,...,An}
then {A1,...,An} must be a candidate key.
So if there is a functional dependency that satisfies 1. and 2. but
{A1,...,An} is not a candidate key, then it is a bad mix.
Q: Wait, wait! You haven't explained yet what functional dependencies and
candidate keys are!
- That's in your book. You can read that for yourself. If you have you will
see that our relation Person2 has only one candidate key ({name, hobby})
and only one functional dependency that satisfies 1. and 2. (name ->
address). Since {name} is not a candidate key the previous rule tells us
that this is a bad mix.
- But this doesn't look like the definition of 3NF in my book.
- There are several ways to define the normal forms, but the main reason is
probably that it defines the Boyce-Codd Normal Form and not 3NF. If we
don't consider the multi-valued and join-dependencies then it's the
strongest but also the easiest NF and it tells us exactly if there are
functional dependencies that cause redundancies or not. I will explain
more about 3NF later on.
- Ok. So now we know that it is a bad mix, what do we do about it?
- First note that a dependency actually corresponds with a certain fact in
the mix. In the Person2 relation it is the fact that a person lives at a
certain address that corresponds with the offending dependency. So what
we have to do is take this fact out of the mix and put it into its own
relation. This new relation will have the columns {A1,...,An,B} (in this
case {name, address}) and from the old table we remove the B column (so
we are left with {name, hobbie}).
After this we check again if the mix is bad and if it is we again give
the offending dependency its own relation.
Q: Ok. But if BCNF is such a great normal form, why then do we still have
3NF?
- Because if you normalize all the way to BCNF you sometimes unmix a little
too much.
- What does that mean exactly?
- That in the result there will be dependencies that hold between columns
in different relations and do not logically follow from dependencies
between columns within the relations.
- Why would that be a problem?
- Because these dependencies are database constraints that should be
maintained by the database. Dependencies that hold over columns in
different relations are usually computationally expensive to maintain.
- So how does 3NF prevent this?
- It allows certain dependencies to stay within the relation even though in
BCNF they would give them their own relation.
- Which dependencies?
- The ones where B is in at least one of the candidate keys?
- Why those?
- Because if you put them in a separate relation you are removing a column
from the relation that was part of a candidate key. For example, if we
have a relation R(a, b, c, d) with candidate key {a, b} and a dependency
c -> b then we should give c -> b its own relation and remove b from the
old relation. But then the dependencies corresponding to the candidate
key ( a, b -> c and a, b -> d) would hold over two relations. So you see
that if we remove a column that belongs to a candidate key we get
dependencies over more than one relation, and that is why 3NF doesn't
consider them offending dependencies.
- Ok. I think I understand. But you said that BCNF exactly indicated if
there were still redundancy problems or not, so that means that 3NF actually
still leaves some of those problems in the relations where BCNF would
remove them, right?
- Exactly. There is a trade-off here, either you may end up with
dependencies over different relations or you are left with redundancies
and anomalies.
- And there is no clever trick to get around that?
- No.
- Bummer.
- Yeah.
- Are you going to say something about 1NF and 2NF?
- Well, if you insist, but I'm running a bit short on time here, so I will
be brief.
- If you must.
- I must. The 1NF only says that the columns in you relations should
contain atomic values. This requirement is the same as one of the
fundamental assumptions of the relational model, so it actually just says
that your relations should satisfy the relational model.
- And the 2NF?
- That's actually not a very useful normal form but it is often used to
give an introduction to the 3NF.
- What's the differenct between 2NF and 3NF then?
- Where 3NF says that {A1,...,An} must be a candidate key, (i.e. is not a
proper subset and not a proper superset of a candidate key) the
requirement in 2NF is just that {A1,...,An} must not a proper subset of a
candidate key.
- I'm not sure if that helps me a lot to understand 3NF.
- Well, it helps a little to understand the usual 'pons asinorum' for 3NF
and BCNF: 'Every (non-key) attribute depends upon the key, the whole key
and nothing but the key. So help me Codd.'
- Sounds cute, but I'm not sure if I follow. What's this about *the* key? I
thought there could be more than one key!? Can you explain this a bit
more? And what is a pons asinorum anyway?
- Sorry. Really have to run now. See you next time. Bye bye.
- But, we haven't even discussed 4NF and 5NF yet! ... bye.
Kind regards,
Received on Wed Aug 14 2002 - 15:35:30 CDT