Re: How to normalize this?

From: Jan Hidders <hidders_at_gmail.com>
Date: Fri, 10 May 2013 12:39:16 +0200
Message-ID: <518cce54$0$6343$e4fe514c_at_dreader35.news.xs4all.nl>


On 2013-05-09 11:09:26 +0000, Jan Hidders said:

> On 2013-05-08 21:58:01 +0000, nvitacolonna_at_gmail.com said:
>

>> On Wednesday, May 8, 2013 12:30:45 PM UTC+2, Erwin wrote:
>>> Op dinsdag 7 mei 2013 16:50:03 UTC+2 schreef nvitac..._at_gmail.com het volgende:
>> 
>>> "And, in the general case, there is no algorithm that is guaranteed to 
>>> give you an answer!"
>>> 
>>> In this example, it seemed to me like the "pullback rule" was 
>>> sufficient to find the answer.  Applying the pullback rule to a finite 
>>> set of explicitly stated INDs, their sources and their targets, 
>>> certainly seems like a feasible piece of algorithm.
>>> 
>>> So what is "the general case" ?
>> 
>> The general case is when you allow a database schema to have FDs 
>> together with an arbitrary set of INDs, in particular INDs that may 
>> form cycles. The pullback rule is not enough to infer all the 
>> consequences. In fact, there is no complete recursively enumerable set 
>> of axioms for finite implication (that is, something like Armstrong 
>> axioms for FDs). For all the details, see:
>> 
>> J. C. Mitchell. The implication problem for functional and inclusion 
>> dependencies. Information and Control, 56:154–173, 1983.
>> M. A. Casanova, R. Fagin, and C. H. Papadimitriou. Inclusion 
>> dependencies and their interaction with functional dependencies. 
>> Journal of Computer and System Sciences, 28:29–59, 1984.
>> A. K. Chandra and M. Y. Vardi. The implication problem for functional 
>> and inclusion dependencies is undecidable. SIAM J. Comput., 
>> 14(3):671–677, Aug. 1985.
>> 
>> I believe that these negative results have basically discouraged 
>> research on normal forms including INDs, although, for many subclasses, 
>> problems become decidable. Ling and Goh's approach, if my memory 
>> assists me, was to define a normal form based only on those 
>> dependencies derivable with the pullback rule (so, using an incomplete 
>> inference system); Mannila and Räihä's approach, later studied by 
>> Levene and Millist, was to impose that FDs and INDs do not interact (so 
>> that the implication problems for FDs and INDs could be treated 
>> independently).

>
> I believe that part of the problem is also that it is not so clear
> anymore what we would like to / can achieve. What does it mean to be
> non-redundant? The old goal was to remove redundancy, something we can
> define exactly and could always achieve, albeit sometimes at the cost
> of inter-relational dependencies, but with INDs there's always a little
> bit of redundancy that we cannot really remove and probably don't want
> to. So can we precisely formulate the new goals? Let me give it a try.
>
> First let me define what I mean with redundancy: a database schema
> allows redundancy if there is an instance of that schema where there is
> a a particar field of a tuple that is derivable from the remainder of
> the instance. In other words, if we replace the value of the field with
> a variable x the we can derive the value of x. I will call a schema
> that does not allow redundancy, redundancy-free.
>
> I will also assume that normalization is being done by taking
> projections. I will get back to that later because there are reasons to
> also include equi-joins. I will make a distinction between liberal
> normalization (the dependencies in the normalized schema imply all
> dependencies in the original schema, but not necessarily vice versa)
> and conservative normalization (also vice versa).
>
> Clearly the goal is not being completely redundancy-free, although it
> is of course better if we can achieve that. The type of redundancy we
> don't seem to mind is where INDs are copying values between different
> columns. It's also clear that you cannot get rid of that type of
> redundancy by taking projections unless you are projecting columns away
> completely, i.e., it does not return anywhere in the normalized schema.
> So we are maximally normalized if:
> 1. There are no columns that can be omitted, and
> 2. The schema is redundancy-free if we consider all derivable
> dependencies except the INDs.
>
> You can generalize this for EGDs and TGDs by replacing "INDs" with
> "TGDs that copy values between different columns."
>
> So can we always reach this normal form? For liberal normalization this
> seems straightforward, but what about conservative normalization? What
> about the case where we start with a single relation with only FDs?

To answer my own question: no, you cannot. Consider the example that started this thread:

R{A, B, C, D, E} with { AB -> E, BC -> E }

Suppose we split into R1{A, B, C, D} and R2{A, B, E}. Consider the following instance:

R1:
a1 b1 c1 d1
a2 b1 c1 d2

R2:
a1 b1 e1
a2 b1 e1(*)

The marked e1 is derivable. The case for splitting off the other dependency is the same.

The normal form does become reachable if we switch to semi-conservative normalization where in the normalized schema we require in the normalized schema not all dependencies derivable from those in the original schema, but only the local ones. That seems to be the position that conventional normalization is taking, or at least the Bernstein synthesis algorithm. Note that under that regime (semi-conservative normalization) you also cannot fully normalize the standard 3NF-but-not-BCNF example: R{A, B, C} with { AB -> C, C -> B }.

  • Jan Hidders
Received on Fri May 10 2013 - 12:39:16 CEST

Original text of this message