Re: How to normalize this?

From: Jan Hidders <hidders_at_gmail.com>
Date: Thu, 9 May 2013 13:09:26 +0200
Message-ID: <518b83e6$0$617$e4fe514c_at_dreader34.news.xs4all.nl>


On 2013-05-08 21:58:01 +0000, nvitacolonna_at_gmail.com said:

> On Wednesday, May 8, 2013 12:30:45 PM UTC+2, Erwin wrote:

>> Op dinsdag 7 mei 2013 16:50:03 UTC+2 schreef nvitac..._at_gmail.com het volgende:

>
>> "And, in the general case, there is no algorithm that is guaranteed to 
>> give you an answer!"
>> 
>> In this example, it seemed to me like the "pullback rule" was 
>> sufficient to find the answer.  Applying the pullback rule to a finite 
>> set of explicitly stated INDs, their sources and their targets, 
>> certainly seems like a feasible piece of algorithm.
>> 
>> So what is "the general case" ?

>
> The general case is when you allow a database schema to have FDs
> together with an arbitrary set of INDs, in particular INDs that may
> form cycles. The pullback rule is not enough to infer all the
> consequences. In fact, there is no complete recursively enumerable set
> of axioms for finite implication (that is, something like Armstrong
> axioms for FDs). For all the details, see:
>
> J. C. Mitchell. The implication problem for functional and inclusion
> dependencies. Information and Control, 56:154–173, 1983.
> M. A. Casanova, R. Fagin, and C. H. Papadimitriou. Inclusion
> dependencies and their interaction with functional dependencies.
> Journal of Computer and System Sciences, 28:29–59, 1984.
> A. K. Chandra and M. Y. Vardi. The implication problem for functional
> and inclusion dependencies is undecidable. SIAM J. Comput.,
> 14(3):671–677, Aug. 1985.
>
> I believe that these negative results have basically discouraged
> research on normal forms including INDs, although, for many subclasses,
> problems become decidable. Ling and Goh's approach, if my memory
> assists me, was to define a normal form based only on those
> dependencies derivable with the pullback rule (so, using an incomplete
> inference system); Mannila and Räihä's approach, later studied by
> Levene and Millist, was to impose that FDs and INDs do not interact (so
> that the implication problems for FDs and INDs could be treated
> independently).

I believe that part of the problem is also that it is not so clear anymore what we would like to / can achieve. What does it mean to be non-redundant? The old goal was to remove redundancy, something we can define exactly and could always achieve, albeit sometimes at the cost of inter-relational dependencies, but with INDs there's always a little bit of redundancy that we cannot really remove and probably don't want to. So can we precisely formulate the new goals? Let me give it a try.

First let me define what I mean with redundancy: a database schema allows redundancy if there is an instance of that schema where there is a a particar field of a tuple that is derivable from the remainder of the instance. In other words, if we replace the value of the field with a variable x the we can derive the value of x. I will call a schema that does not allow redundancy, redundancy-free.

I will also assume that normalization is being done by taking projections. I will get back to that later because there are reasons to also include equi-joins. I will make a distinction between liberal normalization (the dependencies in the normalized schema imply all dependencies in the original schema, but not necessarily vice versa) and conservative normalization (also vice versa).

Clearly the goal is not being completely redundancy-free, although it is of course better if we can achieve that. The type of redundancy we don't seem to mind is where INDs are copying values between different columns. It's also clear that you cannot get rid of that type of redundancy by taking projections unless you are projecting columns away completely, i.e., it does not return anywhere in the normalized schema. So we are maximally normalized if:
1. There are no columns that can be omitted, and 2. The schema is redundancy-free if we consider all derivable dependencies except the INDs.

You can generalize this for EGDs and TGDs by replacing "INDs" with "TGDs that copy values between different columns."

So can we always reach this normal form? For liberal normalization this seems straightforward, but what about conservative normalization? What about the case where we start with a single relation with only FDs?

How am I doing sofar? :-)

Final note on the type of operations we can use when normalizing; I suspect we should also allow equi-joins, or actualy any conjunctive query. Consider schema R(A,B) S(B,C) with { R[B]=>S[B}, S[B}=>R[B] }. According to the above this is fully normalized, but clearly the equivalent R'(A,B,C) with { } is a schema with less redundancy and the DMBS has less to check. But it's not clear to me at the moment how this changes the definition of the normal form. Probably adding an extra rule something like: "you cannot join relations without introducing redundancy as defined under rule 2."

  • Jan Hidders
Received on Thu May 09 2013 - 13:09:26 CEST

Original text of this message