candidate keys in abstract parent relations

From: Forrest L Norvell <spankysyourpal_at_gmail.com>
Date: 17 Jan 2006 19:50:43 -0800
Message-ID: <1137556243.101621.119250_at_o13g2000cwo.googlegroups.com>



Hi. I'm a newbie data modeler, and I'm having a tough time with what seems like a simple data modeling / identity modeling problem, but which has been giving me fits.

For the purposes of this discussion, I need four relations, working from the "top" down: Albums are put out in the form of one or more Releases, and each Album consists of one or more Tracks. A Track is a particular instance of a Recording, so each Recording may appear as different Tracks on multiple Albums.

  1. A Release is an Album that is put out on a Release Date by a Label in a Locale.
  2. Because the Album's identity is the main tough spot, let's leave that alone in this definition, except to say that I'm struggling valiantly to avoid using a surrogate key.
  3. Because Labels reissue Albums all the time, the Release Date is necessary.
  4. While the Label is generally different from Locale to Locale, there are instances in which the same Label releases the same Album on different Release Dates in different Locales (and yes, people really do care about this stuff).

So the candidate key for Releases is Release("Album", Label, Locale, Release Date). (Please correct me if I'm using the notation incorrectly.)

2. The Recording reaches outside of the subset of the model we're discussing here, so assume it has a straightforward natural key as well.

3. The Track is really an association between Recordings and Albums with some additional attributes (track length, credited performer, credited title, etc). It also has a possible unique ID in the ISRC
(International Standard Recording Code), which is issued for pretty
much every track available on a semi-modern format (but not, of course, on bootlegs, which I would also like my model to encompass).

4. The hard part: An Album is a collection of Tracks by a Credited Performer with a Title. The difficulty comes in because I'm trying to be scrupulous, and defining uniqueness on those attributes is proving to be very difficult.

  1. Often, singles aren't titled, although the convention is to use a concatenation of the track titles separated by slashes, but since singles are sometimes explicitly titled, I think you'll agree using a convention here doesn't cut it.
  2. Even worse, consider the case of the first four Peter Gabriel records, all of which were entitled "Peter Gabriel" (there's also the matter of the fact that several of those had both English-language and German-language versions, but that's a separate issue). The important part isn't that the Artist Name matches the Album Name, but that the four Album Names are the same (or, perhaps, absent).

Therefore, Album(Name, Credited Performer) is not guaranteed to be unique.

Releases often do have generated unique IDs (aka ISBN / UPC / EAN codes), but not all of them, and these are tied to a specific Release. "Original release dates" are a functional dependency of the earliest known Release, and "recording date" is a functional dependency of the Recording's "Session", and doesn't work for many kinds of albums (i.e. compilations, best-ofs) anyway.

Upon thinking about it, Albums and Tracks are defined in terms of each other, which is to say the uniqueness of the Album is defined by the collection of Tracks associated with it, and those tracks don't really exist independent of the album. Also, as the collection of tracks can theoretically be entirely distinct between two Albums with the same name (which comes remarkably close to happening sometimes in reality), properties like the number of tracks on the album are not only a functional dependency, but not guarantors of uniqeuness.

There have been at least a few attempts to solve this problem within the recording industry, but I'm not really satisfied with any of them. The Library of Congress has its own system for filing sound recordings, but as far as I'm concerned as an end user with no ability to generate my own LoC classification codes, that's just another generated key
(also, the LoC doesn't even try to catalog all sound recordings).
Musicbrainz (http://www.musicbrainz.org/) uses GUIDs for both Albums and Tracks, and I'd rather just use my own semi-semantic sequence generator (based on one of Celko's non-repeating, pseudorandom sequence generators) than go with GUIDs, although I might stash away the Musicbrainz GUID as an attribute, purely to integrate my application transparently with Musicbrainz (I personally think my data model is richer and cleaner than theirs).

Is there a method to generate a primary key based on entities that are children of a parent relation (and if there is, is it all a good idea to use it?)? Is the Album an overly-abstract construct that should be pushed into the Tracks and Releases? Or am I being overly finicky and should I just write a sequence generator / use an autoincrementing column? In the real world, labels issue the same Album at different times on different formats in different Locales that are struck from the same masters with the same name (but as separate Releases) and the same Recordings on the Release, and I think most people would intuitively understand the relationships as I've sketched them out above, as well as intuitively understanding that two albums with the same title and artist can, on occasion, contain different Tracks, and therefore do not share the same identity.

Any hints on what I should look at next? Received on Wed Jan 18 2006 - 04:50:43 CET

Original text of this message