Re: How to normalize this?

From: Jan Hidders <hidders_at_gmail.com>
Date: Sun, 12 May 2013 19:16:59 +0200
Message-ID: <518fce8b$0$6063$e4fe514c_at_dreader36.news.xs4all.nl>


On 2013-05-10 15:34:34 +0000, Jan Hidders said:

> On 2013-05-10 11:24:35 +0000, nvitacolonna_at_gmail.com said:
> 
>>> 
>>> The old goal was to remove redundancy, something we can> define exactly 
>>> and could always achieve, albeit sometimes at the cost> of 
>>> inter-relational dependencies, but with INDs there's always a little> 
>>> bit of redundancy that we cannot really remove and probably don't want> 
>>> to.

>>
>> Indeed.
>>
>>> First let me define what I mean with redundancy: a database schema> 
>>> allows redundancy if there is an instance of that schema where there 
>>> is> a a particar field of a tuple that  is derivable from the remainder 
>>> of> the instance. In other words, if we replace the value of the field 
>>> with> a variable x the we can derive the value of x. I will call a 
>>> schema>that does not allow redundancy, redundancy-free.

>>
>> Your definition is the so-called “value redundancy”, which
>> characterizes BCNF. It has already been proved that any database schema
>> with non trivial INDs cannot be devoid of value redundancy (see Levene
>> & Millist's paper, Lemma 4.2). In that same paper, they give a more
>> appropriate definition of redundancy and they prove that IDNF
>> (inclusion dependency normal form) is equivalent to avoiding that kind
>> of redundancy (a database schema in IDNF also avoids insertion and
>> modification anomalies, in a sense that can be precisely defined). So,
>> I think that this direction has been already explored and, unless one
>> disagrees with the definitions, IDNF captures the salient notions
>> adequately.
>>
>> For completeness, a database schema with a set of FDs F and a set of
>> INDs I is in IDNF iff it is in BCNF with respect to F and the INDs
>> represent acyclic referential integrity constraints (where the targets
>> are keys, not super-keys).
> 
> You are right (I have looked in the past at that paper, but apparently 
> had forgotten about the details) but I don't agree that this is closed 
> subject. By no means. Note that I am in a sense trying to generalize 
> what they did for larger classes of dependencies. I have to do that 
> since I want to address the questions of achievabality of the normal 
> forms (under liberal/semi-conservative/conservative normalization, as I 
> called them).  Under conservative normalization it can happen that 
> starting from a schema with FDs and INDs I end up with a schema with 
> more complex TGDs then just INDs. Therefore I need to be able to take 
> those into account.

At the risk of turning this group into a monologue, let me explain this a bit more.

The class of dependencies that Levene and Vincent are considering consists of FDs and INDs. But that class is not closed under decomposition when conservatively normalizing. Let me explain what I mean by that. Assume we have a schema R(A, B, C) with { R:A->B, R:B-> C }. If I split into R1(A, B) and R2(A, C) which dependencies would hold there. If I just consider those in FD+IND I get: { R1:A->B, R2:A->C, R1[A]=>R2[A}, R2[A]=>R1[A] }. (I'm using here => to indicate the IND.) But to be really information equivalent the schema should have an additional dependency that represents the old dependency R:B->C. So we need a larger class of dependencies such that if the original schema is normalized by splitting relations the result can always be turned into an information equivalent schema by simply adding all dependencies in that class that follow from the dependencies in the original schema. It is in that sense that the class of dependencies needs to be closed under decomposition.

Another requirement is that the class of dependencies should also be closed under removal of redundant columns and relations. Note that this is posible if we allow in our normalization procedure that the new schema is defined as a set of relations where each is a projection of one of the old relations. This is subtly different from the more usual definition where the new schema is equal to the old except that some relations are decomposed into projections such that these cover all columns of the original relation. The latter definiition does not allow the removal of redundant columns and relations.

An interesting (large) class that seems to satisfy these requirements is the class containing EGDs and TGDs, but there might be also smaller classes possible. For example, the class where we generalize FDs to Q[X->Y] where Q is an algebra expression consisting of natural joins and projections, and it has the meaning the for the result of Q the FD X->Y holds. Likewise we generalize INDs to Q1=>Q1 with the semantics that the result of Q1 is a subset of Q2. I will call these gFDs and gINDs. Note that strictly speaking gINDs do not generalize INDs because we can only have relationships between columns with the same name. (So R[A,B]=>S[A,C] is not allowed, the result of the two queries must have the seam header.) This class seems sufficient if you start of with just one relation and only dependencies of this class. It might even be possible that reasoning over these dependencies is decidable.

Anyway, the problem then is that the trick of L&V to separate the FDs and INDs in the definition of the normal form does not work anymore because the gINDs sometimes express dependencies that cause local redundancy. In particular they allow us to express MVDs and JDs: the JD *(AB, AC) for relation R can be expressed as R[A,B] * R[A,C] => R where * represents the natural join. So which gINDs need to be considered when checking value-redundancy, and which can we ignore? In this case its simple: for checking value-dependency you only consider the gINDs that copy value to a column in a relation only if it was already in that column. E.g., (R[A,B] * S[B,C])[A,B] => R[A,B] should be considered but (R[A,B] * S[B,C])[A,B] => S[A,B] should not. The coneptual / philosophica justification of that is that redundancy that is caused by such dependencies cannot be resolved by taking projections anyway.

  • Jan Hidders
Received on Sun May 12 2013 - 19:16:59 CEST

Original text of this message