candidate keys in abstract parent relations
Date: 17 Jan 2006 19:50:43 -0800
Message-ID: <1137556243.101621.119250_at_o13g2000cwo.googlegroups.com>
Hi. I'm a newbie data modeler, and I'm having a tough time with what seems like a simple data modeling / identity modeling problem, but which has been giving me fits.
For the purposes of this discussion, I need four relations, working from the "top" down: Albums are put out in the form of one or more Releases, and each Album consists of one or more Tracks. A Track is a particular instance of a Recording, so each Recording may appear as different Tracks on multiple Albums.
- A Release is an Album that is put out on a Release Date by a Label in a Locale.
- Because the Album's identity is the main tough spot, let's leave that alone in this definition, except to say that I'm struggling valiantly to avoid using a surrogate key.
- Because Labels reissue Albums all the time, the Release Date is necessary.
- While the Label is generally different from Locale to Locale, there are instances in which the same Label releases the same Album on different Release Dates in different Locales (and yes, people really do care about this stuff).
So the candidate key for Releases is Release("Album", Label, Locale, Release Date). (Please correct me if I'm using the notation incorrectly.)
2. The Recording reaches outside of the subset of the model we're discussing here, so assume it has a straightforward natural key as well.
3. The Track is really an association between Recordings and Albums
with some additional attributes (track length, credited performer,
credited title, etc). It also has a possible unique ID in the ISRC
(International Standard Recording Code), which is issued for pretty
much every track available on a semi-modern format (but not, of course,
on bootlegs, which I would also like my model to encompass).
- Often, singles aren't titled, although the convention is to use a concatenation of the track titles separated by slashes, but since singles are sometimes explicitly titled, I think you'll agree using a convention here doesn't cut it.
- Even worse, consider the case of the first four Peter Gabriel records, all of which were entitled "Peter Gabriel" (there's also the matter of the fact that several of those had both English-language and German-language versions, but that's a separate issue). The important part isn't that the Artist Name matches the Album Name, but that the four Album Names are the same (or, perhaps, absent).
Therefore, Album(Name, Credited Performer) is not guaranteed to be unique.
Releases often do have generated unique IDs (aka ISBN / UPC / EAN codes), but not all of them, and these are tied to a specific Release. "Original release dates" are a functional dependency of the earliest known Release, and "recording date" is a functional dependency of the Recording's "Session", and doesn't work for many kinds of albums (i.e. compilations, best-ofs) anyway.
Upon thinking about it, Albums and Tracks are defined in terms of each other, which is to say the uniqueness of the Album is defined by the collection of Tracks associated with it, and those tracks don't really exist independent of the album. Also, as the collection of tracks can theoretically be entirely distinct between two Albums with the same name (which comes remarkably close to happening sometimes in reality), properties like the number of tracks on the album are not only a functional dependency, but not guarantors of uniqeuness.
There have been at least a few attempts to solve this problem within
the recording industry, but I'm not really satisfied with any of them.
The Library of Congress has its own system for filing sound recordings,
but as far as I'm concerned as an end user with no ability to generate
my own LoC classification codes, that's just another generated key
(also, the LoC doesn't even try to catalog all sound recordings).
Musicbrainz (http://www.musicbrainz.org/) uses GUIDs for both Albums
and Tracks, and I'd rather just use my own semi-semantic sequence
generator (based on one of Celko's non-repeating, pseudorandom sequence
generators) than go with GUIDs, although I might stash away the
Musicbrainz GUID as an attribute, purely to integrate my application
transparently with Musicbrainz (I personally think my data model is
richer and cleaner than theirs).