Re: How to normalize this?

From: Jan Hidders <hidders_at_gmail.com>
Date: Fri, 10 May 2013 10:29:36 +0200
Message-ID: <518caff0$0$6353$e4fe514c_at_dreader35.news.xs4all.nl>


On 2013-05-10 01:52:39 +0000, Jan Hidders said:

> On 2013-05-09 18:38:09 +0000, Erwin said:
> 
>> Op donderdag 9 mei 2013 13:09:26 UTC+2 schreef Jan Hidders het volgende:

>>> On 2013-05-08 21:58:01 +0000, nvita..._at_gmail.com said:
>>>
>>>
>>>
>>>> On Wednesday, May 8, 2013 12:30:45 PM UTC+2, Erwin wrote:

>>>
>>>>> Op dinsdag 7 mei 2013 16:50:03 UTC+2 schreef nvitac..._at_gmail.com het volgende:

>>>
>>>> 

>>>
>>>>> "And, in the general case, there is no algorithm that is guaranteed 
>>>>> to>> >> give you an answer!"

>>>
>>>>> 

>>>
>>>>> In this example, it seemed to me like the "pullback rule" was>> >> 
>>>>> sufficient to find the answer.  Applying the pullback rule to a 
>>>>> finite>> >> set of explicitly stated INDs, their sources and their 
>>>>> targets,>> >> certainly seems like a feasible piece of algorithm.

>>>
>>>>> 

>>>
>>>>> So what is "the general case" ?

>>>
>>>> 

>>>
>>>> The general case is when you allow a database schema to have FDs>> > 
>>>> together with an arbitrary set of INDs, in particular INDs that may>> > 
>>>> form cycles. The pullback rule is not enough to infer all the>> > 
>>>> consequences. In fact, there is no complete recursively enumerable 
>>>> set>> > of axioms for finite implication (that is, something like 
>>>> Armstrong>> > axioms for FDs). For all the details, see:

>>>
>>>> 

>>>
>>>> J. C. Mitchell. The implication problem for functional and inclusion>>
>>>>> dependencies. Information and Control, 56:154–173, 1983.

>>>
>>>> M. A. Casanova, R. Fagin, and C. H. Papadimitriou. Inclusion>> > 
>>>> dependencies and their interaction with functional dependencies.>> > 
>>>> Journal of Computer and System Sciences, 28:29–59, 1984.

>>>
>>>> A. K. Chandra and M. Y. Vardi. The implication problem for functional>>
>>>>> and inclusion dependencies is undecidable. SIAM J. Comput.,>> >
>>>> 14(3):671–677, Aug. 1985.

>>>
>>>> 

>>>
>>>> I believe that these negative results have basically discouraged>> > 
>>>> research on normal forms including INDs, although, for many 
>>>> subclasses,>> > problems become decidable. Ling and Goh's approach, if 
>>>> my memory>> > assists me, was to define a normal form based only on 
>>>> those>> > dependencies derivable with the pullback rule (so, using an 
>>>> incomplete>> > inference system); Mannila and Räihä's approach, later 
>>>> studied by>> > Levene and Millist, was to impose that FDs and INDs do 
>>>> not interact (so>> > that the implication problems for FDs and INDs 
>>>> could be treated>> > independently).

>>>
>>>
>>>
>>> I believe that part of the problem is also that it is not so clear>>
>>> anymore what we would like to / can achieve. What does it mean to be>>
>>> non-redundant? The old goal was to remove redundancy, something we
>>> can>> define exactly and could always achieve, albeit sometimes at the
>>> cost>> of inter-relational dependencies, but with INDs there's always a
>>> little>> bit of redundancy that we cannot really remove and probably
>>> don't want>> to. So can we precisely formulate the new goals? Let me
>>> give it a try.
>>>
>>>
>>>
>>> First let me define what I mean with redundancy: a database schema>>
>>> allows redundancy if there is an instance of that schema where there
>>> is>> a a particar field of a tuple that is derivable from the
>>> remainder of>> the instance. In other words, if we replace the value of
>>> the field with>> a variable x the we can derive the value of x. I will
>>> call a schema>> that does not allow redundancy, redundancy-free.
>>>
>>>
>>>
>>> I will also assume that normalization is being done by taking>>
>>> projections. I will get back to that later because there are reasons
>>> to>> also include equi-joins. I will make a distinction between
>>> liberal>> normalization (the dependencies in the normalized schema
>>> imply all>> dependencies in the original schema, but not necessarily
>>> vice versa)>> and conservative normalization (also vice versa).
>>>
>>>
>>>
>>> Clearly the goal is not being completely redundancy-free, although it>>
>>> is of course better if we can achieve that. The type of redundancy we>>
>>> don't seem to mind is where INDs are copying values between different>>
>>> columns. It's also clear that you cannot get rid of that type of>>
>>> redundancy by taking projections unless you are projecting columns
>>> away>> completely, i.e., it does not return anywhere in the normalized
>>> schema.>> So we are maximally normalized if:
>>>
>>> 1. There are no columns that can be omitted, and
>>>
>>> 2. The schema is redundancy-free if we consider all derivable>>
>>> dependencies except the INDs.
>>>
>>>
>>>
>>> You can generalize this for EGDs and TGDs by replacing "INDs" with>>
>>> "TGDs that copy values between different columns."
>>>
>>>
>>>
>>> So can we always reach this normal form? For liberal normalization
>>> this>> seems straightforward, but what about conservative
>>> normalization? What>> about the case where we start with a single
>>> relation with only FDs?
>>>
>>>
>>>
>>> How am I doing sofar? :-)
>>>
>>>
>>>
>>> Final note on the type of operations we can use when normalizing; I>>
>>> suspect we should also allow equi-joins, or actualy any conjunctive>>
>>> query. Consider schema R(A,B) S(B,C) with { R[B]=>S[B}, S[B}=>R[B] }.>>
>>> According to the above this is fully normalized, but clearly the>>
>>> equivalent R'(A,B,C) with { } is a schema with less redundancy and
>>> the>> DMBS has less to check. But it's not clear to me at the moment
>>> how this>> changes the definition of the normal form. Probably adding
>>> an extra>> rule something like: "you cannot join relations without
>>> introducing>> redundancy as defined under rule 2."
>>>
>>>
>>>
>>> -- Jan Hidders
>> 
>> 
>> 
>> "Consider schema R(A,B) S(B,C) with { R[B]=>S[B}, S[B}=>R[B] }.
>> According to the above this is fully normalized, but clearly the
>> equivalent R'(A,B,C) with { } is a schema with less redundancy and the
>> DMBS has less to check."
>> 
>> If there is a JD *(R,S) then conventional normalization theory says 
>> "maintaining the split" is the better option.  Interesting.  So let's 
>> elaborate the two cases.
>> 
>> (A) one schema SR{A,B,C} with the constraint expressing the JD : SR === 
>> SR{A,B} JOIN SR{B,C}.  That's two INDs
>> 
>> SR SUBSETOF ( SR{A,B} JOIN SR{B,C} )
>> ( SR{A,B} JOIN SR{B,C} ) SUBSETOF SR
>> 
>> Eurhm, well, if I'm right that INDs can also include arbitrary RA expressions.
>> 
>> This is "hard" on all stakeholders (user, designer & DBMS) of this database :
>> 
>> The computations involved for the DBMS in checking the constraint are 
>> not exactly trivial.
>> Determining which access paths (indexes) need to be available for 
>> efficient checking to be possible might be not exactly trivial either.
>> If we wanted to write a DELETE statement that always passes the 
>> constraint (not very sure of how relevant this is), then deleting a 
>> tuple a1,b1,c1 will involve DELETE (SR{A,B} WHERE B=b1) JOIN (SR{B,C} 
>> WHERE B=b1) FROM SR;
>> Something similar, but utterly more horrendous, applies to INSERTs of a 
>> tuple a1,b1,c1.
>> 
>> 
>> 
>> (B) two schema R{A,B} and S{B,C} with the constraint expressing the "no 
>> join orphans" in either schema : R{B} === S{B}.  Once again two INDs
>> 
>> R{B} SUBSETOF S{B}
>> S{B} SUBSETOF R{B}
>> 
>> The computations needed for checking these seem simpler.
>> Determining needed indexes for efficiency of checking is fairly trivial.
>> Writing INSERTs and DELETEs that are "guaranteed to pass" for a given 
>> a1,b1,c1 (once again not sure though how relevant this is) are trivial 
>> for INSERT   (just INSERT {a1,b1} INTO R, INSERT {b1,c1} INTO S; ) but 
>> substantially more intracate for DELETE.
>> 
>> 
>> 
>> All in all, it seems to me like if there is a JD between the attributes 
>> of the decomposed schemata, the conventional rule for 5NF should 
>> prevail despite the "redundancy" you considered.
> 
> Thanks for pointing that out. You are of course completely right. For 
> some reason I had forgotten that there was a join-dependency in the 
> result of the join (duh), and those are indeed hard to check fo the 
> database.

… unless they are implied by keys. So the proper example should be:

R{A, B} S{B, C} with { R.B -> R.A, S.B -> S.C, R[B] => S[B], S[B] => R[B] }

which perhaps should be normalized to:

T'{A, B, C} with { T.B -> T.A, T.B -> S.C }

So I need to recant my recantation. :-) Projection should perhaps be considered as normalizing operations.. But it complicates the theory, so it's probaly better to first try to understand normalization without it.

With respect to another comment you made: "Eurhm, well, if I'm right that INDs can also include arbitrary RA expressions."

Yes, they can. Strictly speaking normal INDs don't allow that, but we need to move here to a more general and powerfull class of dependencies known as TGDs (tuple-generating dependencies) and EGDs (equality generating dependencies). And TGDs can indeed be described as INDs where on the left-hand side you can use in the place of the relation name a conjunctive query, i.e., an RA expression containing relation names, equality selections, cartesian products, projections and renaming (but not the other RA operations). As you already saw, that class then also generalize MVDs and JDs.

The EGDs generalize functional dependencies: you specify a conjunctive query and state then that two columns are identical. For example, the FD A->B in R means that in

SELECT[A=A'](R TIMES RENAME[A to A', B to B'](R))

it holds that B and B' always contain the same value.

  • Jan Hidders
Received on Fri May 10 2013 - 10:29:36 CEST

Original text of this message