Re: candidate keys in abstract parent relations

From: Forrest L Norvell <spankysyourpal_at_gmail.com>
Date: 20 Jan 2006 12:14:42 -0800
Message-ID: <1137788082.530515.284880_at_o13g2000cwo.googlegroups.com>


-CELKO- wrote:
> Observation:
>
> Nobody here proposed looking at Industry Standards. Why do we like to
> re-invent the wheel?

Well, yeah, except for maybe me. If you go back and take a closer look at my original post (which I'll admit is prolix, as my technical writing generally is), you'll see that I specified the industry-related codes which I have considered, and rejected, for use in this model. Just a quick, enumerated recap:

  1. The primary identity used by the international intellectual property establishment (which transcends the RIAA and the various music licensors) is the ISRC. I found two main weaknesses for using it for my purposes:
  2. It is attached at the track, rather than the album level.
  3. While the ISRC is supposed to be unique per track, ISRCs are recycled between albums if the label believes the track to be identical. This is a problem because labels frequently hide uncredited edits of the tracks, so you can have tracks with identical ISRCs but which differ by minutes in length. To use ISRCs as part of a candidate key, you need Tracks(isrc, duration), and not Tracks(isrc) alone.
  4. The primary identity used by music retailers and distributors is the ISBN / UPC / EAN associated with the barcode(s) on the release. Anyone familiar with buying and selling independent media knows that the glaring problem with using that as an identifier for releases is the plethora of independent labels (and distributors) who are free-wheelin' and fun-lovin' and can't be bothered with obtaining either. As DIY distribution has grown more sophisticated in the Internet age, this has become less prevalent, but there's still a HUGE volume of music released without these codes.
  5. Musicbrainz has taken a whack at providing unique identifiers for both albums (independent of release status, which is exactly what I'm looking for) and tracks. It's a noble effort to bring some stability to the chaotic sea of musical metadata on the Internet (Musicbrainz was invented to help identify random audio files you have sitting on your hard drive), but it's not an authoritative source. Also, it uses GUIDs, and you yourself said GUIDs were a bad choice for keys, Mr. Celko.
  6. Various libraries (including the Library of Congress) have defined categorization schemas for audio recordings. They're useful for classification but collapse under the weight of popular music, do not cover much music actually actually released and consumed by average listeners, and are designed for classification and not identification.
  7. Muze, Gracenote and the FreeDB project all provide lookup services for CDs (which is to say, releases rather than albums). They use the embedded table of contents on the CD to define identity (although anyone who's used one of the covered services knows that there are disc collisions all the time), but the identifiers they use are, again, basically surrogate keys into their own schemas. Also, none of them even pretend to have uniqueness or sound relational principles as a design goal. And some of them cost money to use.

> But why did Roy
> miss the entire music industry? This is the guy who knows about
> additive congruential methods of generating values in pseudo-random
> order!

Roy's answers have been extremely helpful to me. Without everyone's input, I never would have arrived at my present schema, which I think is going to work. When I get it a little fleshed out, I'll post the DDL for the relevant tables for everyone's dissection / approval / castigation.

> I offer that we are still thinking like "procedural coders" and NOT
> like "database people" instead.

I submit that this has nothing to do with procedural vs relational thinking. I included the results of my research in my original post, which while not exhaustive are certainly pretty extensive, to show that I had hit a wall because nobody's really tried to solve this problem exactly the same way as me. After my research, I realized I was still screwed. Also, even though I don't think it's particularly relevant, my day job is working for a major music download service, so I'm working closely with a real-world schema dealing with similar data on a daily basis. My day job uses auto-generated sequence IDs like crazy, because it turns out defining this model is hard. Received on Fri Jan 20 2006 - 21:14:42 CET

Original text of this message