Re: How to normalize this?

From: Jan Hidders <hidders_at_gmail.com>
Date: Thu, 23 May 2013 18:43:42 +0200
Message-ID: <519e473e$0$581$e4fe514c_at_dreader34.news.xs4all.nl>


On 2013-05-23 12:43:46 +0000, hugokornelis_at_gmail.com said:

> Sorry for my long absence, everyone. When there were no new replies
> after a week, I had stopped checking. And now I come back and see that
> in the mean time, a long and very interesting discussion has taken
> place.
>
> Thanks for your additional insights, Jan, Erwin, Philip, and Nicola! (I
> hope I missed nobody)
>
>
> On Wednesday, May 1, 2013 11:20:47 AM UTC+2, Jan Hidders wrote:

>> 
>> But if you have a realistic example that would show otherwise,>> that 
>> would be interesting.

>
> I do indeed have a realistic (well, somewhat) example that illustrates
> the problem that started this topic. I have already presented it
> earlier in the discussion, but I'll repeat it. And I will reword it
> slightly to address the concern, expressed in some messages, that the
> issue can be an artifact of the initial table design.
>
> As mentioned before - this is not an example I came across for real, I
> made it up to check for myself if the set of FDs I presented in the
> first message would even be possible. The example is a bit far-fetched,
> but I think realistic enough that it could be real.

No problem. Not real, but realistic, is fine. And I agree that it is that.

> A charity is organizing a big fundraiser event. Several rock bands and
> several opera singers will participate. Throughout the country, in
> different locations, there will be performances where one rock band and
> one opera singer will sing one song together. This event will last a
> whole month. On any given date in that month, there may be performances
> in different locations. At any given location, there may be multiple
> performances at different dates.
> A band can play more than one song on a given date, but not with the
> same opera singer. They might have another performance with the same
> singer at another date, either at the same or another location, and
> either with the same or another song. If a band does play with multiple
> opera singers at the same date, it may be the same or a different song.
> An opera singer can play more than one song on a given date, but not
> with the same band. They might have another performance with the same
> band at another date, either at the same or another location, and
> either with the same or another song. If an opera singe does play with
> multiple bands at the same date, it may be the same or a different song.
>
> Translating the above in functional dependencies results in: (Band,
> Singer, Date) --> Location and (Band, Singer, Date) --> Song.
>
> But there is one additional requirement. If a band plays multiple songs
> on the same date, they all have to be in the same location. And the
> same restriction goes for opera singers. Though technically possible to
> have a band or singer perform in New York and London on the same day,
> the organization lacks the travel budget for it. So now we have (Band,
> Date) --> Location and (Singer, Date) --> Location. One of the original
> FDs, (Band, Singer, Date) --> Location, is now not a full FD anymore,
> as it is implied by these two new FDs.
>
>
> Note that the entire description makes no mention at all of tables.
> It's just the functional requirements, and the resulting FDs. So there
> is no initial table design that can cause artifacts.

It's to some extent nitpicking, but FDs only make sense in the context of a single table, so in that sense you have assumed the existence of a table. However, I still think it is a good example of what you would like to demonstrate.

> As far as I can see, there are no other FDs implied by the scenario. I
> apply Bernsteins algorithm to the FDs, and get three relations:
> R1: {Band, Singer, Date, Song} with (Band, Singer, Date --> Song)
> R2: {Band, Date, Location} with (Band, Date --> Location)
> R3: {Singer, Date, Location} with (Singer, Date --> Location)
>
> The inclusion dependencies are totally obvious, but for completeness
> sake I'll mention them:
> 1: R1.(Band, Date) --> R2.(Band, Date)
> 2: R2.(Band, Date) --> R1.(Band, Date) (*)
> 3: R1.(Singer, Date) --> R3.(Singer, Date)
> 4: R3.(Singer, Date) --> R1.(Singer, Date) (*)
> (*) These two are, unfortunately, not enforcable by most popular
> RDBMS's. But let's assume for now that we have one that doesn't suffer
> from this limitation. This restriction on most RDBMS's is sufficiently
> known that I expect the application code to enforce the dependency.

These are indeed all inclusion dependencies, but not all dependencies that should be maintained. If you are conservatively normalizing (i.e. do not allow the liberalization of the meaning of the components, and want your targe schema to be infomation equivalent) you should also add the dependency that states that if a band and singer are performing a song together on a certain date, then both band and singer should be in the same location. That's representable as an EGD, but not FD or IND.

> Now suppose that Placido Domingo and Fun are scheduled to play
> "Imagine" in London on July 15, 2013. Placido calls and says he can't
> make it to London, he has to be in New York for other obligations. The
> organization has three options: cancel this performance, reschedule for
> another date, or move it to New York.
> Choice 1: Cancel. Remove the tuple from R3, and INDs 4 and 1 force you
> to remove (or do it automatically, if defined as cascading) the
> performance of Placido and Fun of Imagine, and the appearance of Fun in
> London. That is okay; this is what you would expect in this scenario if
> a singer has to cancel.
> Choice 2: Reschedule. Change the date in R3 to July 17, and again INDs
> 4 and 1 force you to change (or do it automatically, if defined as
> cascading) the date in R1 and R2 as well. So Imagine will now be played
> by Fun and Placido on May 17, and Fun will be in London at that time.
> Again, just what we want.
> Choice 3: And here's the problem. I can update the location for
> (Placido, July 15) in R3 to New York, and the RDBMS will accept the
> location. So unless someone wakes up at the right time, we will have
> Fun in London and Placido in New York for their joint performance of
> Imagine. Let's hope someone sets up the sattelite connection in time! ;)

Let's indeed. :-) But with the missing dependecy added this update would not be possbiel.

> I expect that, based on this realistic example, you'll reaffirm that
> there is no better way than this, and that I'll just need more code to
> ensure that singers and bands that are supposed to perform together
> will, actually, do so at the same location. But if this example does
> open the door to alternative solutions, I'm all ears!

With the risc of stating the obvious: strictly speaking normalization theory does not tell you what to do or what is best for you, it only tells you what potential problems exist when en what you can (or cannot) do about them. Just saying. ;-)

What I like about your example, and for me it's most important point, as that indeed there are cases where you do want to normalize conservatively. Being liberal here would mean for example that we reinterpret R2 and R2 for example as "are present but not necessarily performing", and R1 as "are scheduled to sing". But that would not even conserve all information, unless you assume that if band and singer are scheduled and both present they will indeed perfrom.

Secondly it illustrates that if you are normalizing conservatively it is actually not true that you can always reach 3NF and be dependency preserving in the sense that you don't need to introduce (rather unusual) inter-relational dependencies. The latter notion of dependency preserving is not how it is usually defined, but can under certain practical circumstances be the more relevant notion.

So, well done, and thanks. :-)

  • Jan Hidders
Received on Thu May 23 2013 - 18:43:42 CEST

Original text of this message