Re: How to normalize this?

From: Jan Hidders <hidders_at_gmail.com>
Date: Fri, 10 May 2013 17:34:34 +0200
Message-ID: <518d138a$0$6351$e4fe514c_at_dreader35.news.xs4all.nl>


On 2013-05-10 11:24:35 +0000, nvitacolonna_at_gmail.com said:

>> 
>> The old goal was to remove redundancy, something we can> define exactly 
>> and could always achieve, albeit sometimes at the cost> of 
>> inter-relational dependencies, but with INDs there's always a little> 
>> bit of redundancy that we cannot really remove and probably don't want> 
>> to.

>
> Indeed.
>
>> First let me define what I mean with redundancy: a database schema> 
>> allows redundancy if there is an instance of that schema where there 
>> is> a a particar field of a tuple that  is derivable from the remainder 
>> of> the instance. In other words, if we replace the value of the field 
>> with> a variable x the we can derive the value of x. I will call a 
>> schema>that does not allow redundancy, redundancy-free.

>
> Your definition is the so-called “value redundancy”, which
> characterizes BCNF. It has already been proved that any database schema
> with non trivial INDs cannot be devoid of value redundancy (see Levene
> & Millist's paper, Lemma 4.2). In that same paper, they give a more
> appropriate definition of redundancy and they prove that IDNF
> (inclusion dependency normal form) is equivalent to avoiding that kind
> of redundancy (a database schema in IDNF also avoids insertion and
> modification anomalies, in a sense that can be precisely defined). So,
> I think that this direction has been already explored and, unless one
> disagrees with the definitions, IDNF captures the salient notions
> adequately.
>
> For completeness, a database schema with a set of FDs F and a set of
> INDs I is in IDNF iff it is in BCNF with respect to F and the INDs
> represent acyclic referential integrity constraints (where the targets
> are keys, not super-keys).

You are right (I have looked in the past at that paper, but apparently had forgotten about the details) but I don't agree that this is closed subject. By no means. Note that I am in a sense trying to generalize what they did for larger classes of dependencies. I have to do that since I want to address the questions of achievabality of the normal forms (under liberal/semi-conservative/conservative normalization, as I called them). Under conservative normalization it can happen that starting from a schema with FDs and INDs I end up with a schema with more complex TGDs then just INDs. Therefore I need to be able to take those into account. What is also missing in their work is a characterization of when you can get to IDNF, and an algorithm that when this is possible gets you there. I realized that strictly speaking such an algorithm will not exist because of the undecidability of reasoning over the dependencies in question, but it would still be interesting to see if it exists while assuming an oracle that can do the reasoning for you. It might be practical for certain restricted sets of dependencies where inference is computable. Maybe this is the case, for example for the dependencies implied by conservative normalization when starting out with only FDs.

Note that the type of analyis I want to do is the one that tells you for example that even if you start with just FDs then (generalized) RFNF / BCNF is not always reachable, assuming you conservatively normalize. That was actually a bit of a suprise to me. I always assumed you could reach a redundancy-free normal form if you were willing to pay the price of a non-local dependency, but that is actually not true. So what about the cases where you can? Can I use the usual "split of violating dependencies" strategy without backtracking? It's that type of question that I'd like to investigate.

>> So we are maximally normalized if:
>> 1. There are no columns that can be omitted, and
>> 2. The schema is redundancy-free if we consider all derivable> 
>> dependencies except the INDs.

>
> It would be interested to verify how this compares with the above. I am
> not sure whether I understand exactly what you mean by 2.

It generalizes RFNF. Note that "INDs" should be replaced with "TGDs and EGDs that copy/equate values between different columns" where "different" is defined as "is in a different relation or has a different name, or both". (I forgot about EGDs in the previous posting, but they should also be considered.) This happens in TGDs if in the head a variable appears in a different column then it does in the tail, and in an EGD if the two variables in the equation in the head appear in different columns in the tail. Note this means that FDs, JDs and MVDs stay.

So in stead of the original set of dependencies we consider a set that implies the same set of dependencies except those that copy/equate values between different columns. So INDs, for example, are ignored and in that sense it is more liberal then IDNF. The reasoning here is that the redundancy that is caused by these dependencies cannot be removed anyway by taking projections, except if you can omit some of the involved columns. So it makes sense to ignore that type of redundancy.

>> Final note on the type of operations we can use when normalizing; I> 
>> suspect we should also allow equi-joins, or actualy any conjunctive> 
>> query.

> So, your starting point is an arbitrary set of schemas instead of one,
> to which you add FDs and inter-relational constraints as per your
> requisites, and then you apply some rule to split as well as merge the
> schemas. Sounds challenging :)

That's because I'm first trying to establish what the problem is that we *should* be solving. In practic you probably would like to join if it means you can get rid of a few INDs and it doesn't introduce new redundancy. However, I will no doubt start making simplifications once I really start to investigate this, and assuming that we only do projections would be very likely one of them. :-) But I think it needs to be made clear that this is indeed a simplifying assumption.

> Btw, is there any intrinsic difference in considering several schemas
> instead of a single universal schema, as far as this process is
> concerned?

I don't know. The only way of finding out is to formulate both cases and see if there is a difference. It is to me not a priori clear whether if I focus within a database schema on a single relation, its normalization will be influenced by non-local dependencies, especially since I'm considering such large classes of them.

  • Jan Hidders
Received on Fri May 10 2013 - 17:34:34 CEST

Original text of this message