Re: How to normalize this?

From: Erwin <e.smout_at_myonline.be>
Date: Thu, 9 May 2013 11:38:09 -0700 (PDT)
Message-ID: <f34c0b20-a4c3-4fb7-997e-c59cba0d23ab_at_googlegroups.com>


Op donderdag 9 mei 2013 13:09:26 UTC+2 schreef Jan Hidders het volgende:
> On 2013-05-08 21:58:01 +0000, nvita..._at_gmail.com said:
>
>
>
> > On Wednesday, May 8, 2013 12:30:45 PM UTC+2, Erwin wrote:
>
> >> Op dinsdag 7 mei 2013 16:50:03 UTC+2 schreef nvitac..._at_gmail.com het volgende:
>
> >
>
> >> "And, in the general case, there is no algorithm that is guaranteed to
>
> >> give you an answer!"
>
> >>
>
> >> In this example, it seemed to me like the "pullback rule" was
>
> >> sufficient to find the answer. Applying the pullback rule to a finite
>
> >> set of explicitly stated INDs, their sources and their targets,
>
> >> certainly seems like a feasible piece of algorithm.
>
> >>
>
> >> So what is "the general case" ?
>
> >
>
> > The general case is when you allow a database schema to have FDs
>
> > together with an arbitrary set of INDs, in particular INDs that may
>
> > form cycles. The pullback rule is not enough to infer all the
>
> > consequences. In fact, there is no complete recursively enumerable set
>
> > of axioms for finite implication (that is, something like Armstrong
>
> > axioms for FDs). For all the details, see:
>
> >
>
> > J. C. Mitchell. The implication problem for functional and inclusion
>
> > dependencies. Information and Control, 56:154–173, 1983.
>
> > M. A. Casanova, R. Fagin, and C. H. Papadimitriou. Inclusion
>
> > dependencies and their interaction with functional dependencies.
>
> > Journal of Computer and System Sciences, 28:29–59, 1984.
>
> > A. K. Chandra and M. Y. Vardi. The implication problem for functional
>
> > and inclusion dependencies is undecidable. SIAM J. Comput.,
>
> > 14(3):671–677, Aug. 1985.
>
> >
>
> > I believe that these negative results have basically discouraged
>
> > research on normal forms including INDs, although, for many subclasses,
>
> > problems become decidable. Ling and Goh's approach, if my memory
>
> > assists me, was to define a normal form based only on those
>
> > dependencies derivable with the pullback rule (so, using an incomplete
>
> > inference system); Mannila and Räihä's approach, later studied by
>
> > Levene and Millist, was to impose that FDs and INDs do not interact (so
>
> > that the implication problems for FDs and INDs could be treated
>
> > independently).
>
>
>
> I believe that part of the problem is also that it is not so clear
>
> anymore what we would like to / can achieve. What does it mean to be
>
> non-redundant? The old goal was to remove redundancy, something we can
>
> define exactly and could always achieve, albeit sometimes at the cost
>
> of inter-relational dependencies, but with INDs there's always a little
>
> bit of redundancy that we cannot really remove and probably don't want
>
> to. So can we precisely formulate the new goals? Let me give it a try.
>
>
>
> First let me define what I mean with redundancy: a database schema
>
> allows redundancy if there is an instance of that schema where there is
>
> a a particar field of a tuple that is derivable from the remainder of
>
> the instance. In other words, if we replace the value of the field with
>
> a variable x the we can derive the value of x. I will call a schema
>
> that does not allow redundancy, redundancy-free.
>
>
>
> I will also assume that normalization is being done by taking
>
> projections. I will get back to that later because there are reasons to
>
> also include equi-joins. I will make a distinction between liberal
>
> normalization (the dependencies in the normalized schema imply all
>
> dependencies in the original schema, but not necessarily vice versa)
>
> and conservative normalization (also vice versa).
>
>
>
> Clearly the goal is not being completely redundancy-free, although it
>
> is of course better if we can achieve that. The type of redundancy we
>
> don't seem to mind is where INDs are copying values between different
>
> columns. It's also clear that you cannot get rid of that type of
>
> redundancy by taking projections unless you are projecting columns away
>
> completely, i.e., it does not return anywhere in the normalized schema.
>
> So we are maximally normalized if:
>
> 1. There are no columns that can be omitted, and
>
> 2. The schema is redundancy-free if we consider all derivable
>
> dependencies except the INDs.
>
>
>
> You can generalize this for EGDs and TGDs by replacing "INDs" with
>
> "TGDs that copy values between different columns."
>
>
>
> So can we always reach this normal form? For liberal normalization this
>
> seems straightforward, but what about conservative normalization? What
>
> about the case where we start with a single relation with only FDs?
>
>
>
> How am I doing sofar? :-)
>
>
>
> Final note on the type of operations we can use when normalizing; I
>
> suspect we should also allow equi-joins, or actualy any conjunctive
>
> query. Consider schema R(A,B) S(B,C) with { R[B]=>S[B}, S[B}=>R[B] }.
>
> According to the above this is fully normalized, but clearly the
>
> equivalent R'(A,B,C) with { } is a schema with less redundancy and the
>
> DMBS has less to check. But it's not clear to me at the moment how this
>
> changes the definition of the normal form. Probably adding an extra
>
> rule something like: "you cannot join relations without introducing
>
> redundancy as defined under rule 2."
>
>
>
> -- Jan Hidders

"Consider schema R(A,B) S(B,C) with { R[B]=>S[B}, S[B}=>R[B] }. According to the above this is fully normalized, but clearly the equivalent R'(A,B,C) with { } is a schema with less redundancy and the DMBS has less to check."

If there is a JD *(R,S) then conventional normalization theory says "maintaining the split" is the better option. Interesting. So let's elaborate the two cases.

(A) one schema SR{A,B,C} with the constraint expressing the JD : SR === SR{A,B} JOIN SR{B,C}. That's two INDs

SR SUBSETOF ( SR{A,B} JOIN SR{B,C} )
( SR{A,B} JOIN SR{B,C} ) SUBSETOF SR Eurhm, well, if I'm right that INDs can also include arbitrary RA expressions.

This is "hard" on all stakeholders (user, designer & DBMS) of this database :

The computations involved for the DBMS in checking the constraint are not exactly trivial. Determining which access paths (indexes) need to be available for efficient checking to be possible might be not exactly trivial either. If we wanted to write a DELETE statement that always passes the constraint (not very sure of how relevant this is), then deleting a tuple a1,b1,c1 will involve DELETE (SR{A,B} WHERE B=b1) JOIN (SR{B,C} WHERE B=b1) FROM SR; Something similar, but utterly more horrendous, applies to INSERTs of a tuple a1,b1,c1.

(B) two schema R{A,B} and S{B,C} with the constraint expressing the "no join orphans" in either schema : R{B} === S{B}. Once again two INDs

R{B} SUBSETOF S{B}
S{B} SUBSETOF R{B} The computations needed for checking these seem simpler. Determining needed indexes for efficiency of checking is fairly trivial. Writing INSERTs and DELETEs that are "guaranteed to pass" for a given a1,b1,c1 (once again not sure though how relevant this is) are trivial for INSERT (just INSERT {a1,b1} INTO R, INSERT {b1,c1} INTO S; ) but substantially more intracate for DELETE.

All in all, it seems to me like if there is a JD between the attributes of the decomposed schemata, the conventional rule for 5NF should prevail despite the "redundancy" you considered. Received on Thu May 09 2013 - 20:38:09 CEST

Original text of this message