Re: Normalisation

From: Jon Heggland <heggland_at_idi.ntnu.no>
Date: Tue, 5 Jul 2005 18:09:32 +0200
Message-ID: <MPG.1d34cf104874d4119896e0_at_news.ntnu.no>


In article <42ca8086$0$30524$ed2619ec_at_ptn-nntp-reader01.plus.net>, paul_at_test.com says...
> > You can do outer joins without nulls.
>
> An outer join is just syntactic shorthand for an inner join followed by
> a union (of the relevant missing rows). It could easily be extended to
> remove the NULLs problem by forcing the user to specify default values
> instead.

Yes, but that is still not quite satisfying. Consider an outer join of persons and dogs based on dog ownership. A person can own multiple dogs, or none, while a dog has but a single owner. An outer join is supposed to list all persons, along with the dogs they own. With default values, this would result in a relation specifying that a lot of persons own the same default dog.  

> I don't quite understand how nested relations gets round the problem,
> though. Because a nested empty relation takes the place of a NULL?

Yes. A corresponding outer join with nested relations would have *one* tuple for each person, with the set of dogs they own as a (single) attribute value. For person who don't own dogs, the nested relation would be empty; for persons with four dogs, it would contain four dog tuples. I think it is a cleaner solution.

> > Also, nested relations are used
> > heavily in the definitions of the operators for handling interval types
> > in temporal relational DBMSs (Date, Darwen and Lorentzos, "Temporal data
> > and the relational model").
>
> I confess I've not read this and can't find a link to a summary of why
> this requires nested relations. It seems to my uninformed mind that you
> can talk about time in first order logic, so what is the problem in
> dealing with temporal data using the standard relational model? I can
> understand how specialised domains like intervals might be useful, or
> even some syntactic shorthand for complicated constructs like
> overlapping periods etc., but not the need for serious amendments to the
> relational model like nested relations.

I'm not sure if they are required (they probably aren't), but they are convenient. The authors are very insistent that they indeed do not amend the RM, they just introduce syntactic shorthands. In any case, grouped (or nested) relations are used only as intermediate results to construct/explain the operators/shorthands they define.

It would be too much work for me to try to summarise their work here, but Date's "An Introduction to Database Systems" 8th ed. has a chapter about it.

> >>I agree that there is nothing to stop someone having a "relation" domain
> >>but I think this should all be encapsulated in the domain and not
> >>pollute the relational model.
> >
> > I must admit I am unsure of the significance of encapsulation here.
> > Would anyone care to explain?
>
> Just that all the domain stuff is hidden away from view. So the domain
> values are just like "black boxes" to the relational part. The domain
> part knows that the internal structure of the values are relations, but
> the relational part doesn't. I'm not sure if "encapsulation" is the
> official terminology, but I'm just making up the jargon here.

I think the terminology is just fine, I'm just wondering what the implications of a relation not being encapsulated is. We don't do anything with it except use operators on it, do we? What would be the difference between an encapsulated and a non-encapsulated relation?

> > Also, I'm not sure what the "pollution" consists of. Is it the
> > introduction of GROUP/UNGROUP (nest/unnest)operators? GROUP can be
> > defined as a shorthand for a certain kind of EXTEND, iirc---I'm not sure
> > about UNGROUP.
>
> The pollution is the breaking of the idea that the values in the columns
> are handled by the domain part of the DBMS, and are atomic from the
> point of view of the relational part. (I've used the terms "relational
> engine" and "domain/type engine" in the past to describe this.)

Well, yeah, but relation types are domains too.

> Maybe this is all just a matter of perception though, and it's not
> really important to make this separation?

That is what I am wondering. Date & co. also make that separation (though they prefer the terms "scalar" and "non-scalar") but I don't have deep understanding of the reason or the significance.

> Looking at the Third Manifesto book, it looks like their GROUP/UNGROUP
> operators are shorthand for something involving EXTEND, but also
> "RELATION", which is a type generator. What I'm not getting is why the
> relational model needs a relation type generator - shouldn't that be
> part of the domain engine and kept separate from the relational engine?

The relational engine needs to know how to perform relational operators. Relational operators are (generic) operators associated with the (generic) domain of relations. Thus, the relation engine *must* know about this domain. Indeed, you could view a relational database as nothing more than the domain of relations, and (persistent) variables of that type. (And facilities for creating other domains.)

You use a relation type generator each time you create a table, to use SQL terminology. CREATE TABLE T ( A INT ) is the declaration of a variable named T, with the type RELATION { A INT } (disregarding irrelevant details about SQL).

> > Yes, you shouldn't use relation-valued attributes just because you can.
> > In most cases, they are a bad design, which Date is quick to point out.
>
> OK, I'm going one step further here and claiming they are *always* a bad
> design. :) I'm offering no real proof of this (and I don't think one is
> really possible) other than a vague appeal to Occam's Razor.

Fair enough. :) The way I see it, they should be allowed just because disallowing them would be a special case, with weak justification.

-- 
Jon
Received on Tue Jul 05 2005 - 18:09:32 CEST

Original text of this message