Re: Is an URL a good primary key? (or what else?)

From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Wed, 06 Oct 2010 15:14:57 -0300
Message-ID: <4cacbca2$0$14800$9a566e8b_at_news.aliant.net>


Lorenzo Lazzeri wrote:

> I'm designing a database for an application which has to do with web
> URLs.
> I'll have tables to store sites attributes as well as specific pages
> and RSS/Atom feeds. I'll have millions of them.
> Everything has an URL which is by definition unique. I'm in the
> "natural key" bandwagon, but I have doubts to use the URLs themselves
> as primary key, for the reason that they can be very (very) long. The
> HTTP standard doesn't set a limit, and the smaller max size limit
> imposed by a browser is 2048 characters. So I need to have URLs field
> of at least CHAR(2048).
> If I use URL as primary key, will that size create problems of
> performance in joins?
>
> So I tried with a surrogate key, the sha1() hash of the URL. It is
> very useful, still long 40 chars, but it has all the problems of
> surrogate key: it says nothing about the data it refers to if taken by
> itself, and it needs some triggers and on update cascade costraints to
> keep the whole schema coherent if something (the url...) changes.
> Moreover, I need a further key on the url to speed up searches.
>
> I want to avoid auto incremental id for the well known reasons:
> moreover it has to do with the data even less than the hash (i.e.
> nothing).
>
> So I'm looking for and advice on it. Should I use URLs as primary key
> or what else? Is that performance problem on join really an issue or
> can I ignore it? May be because the index is already treated as an
> hash internally by the db?
>
> In the meantime I'm exploring another solution. Since I'll have at
> least seven tables with URL as discriminant (and couting as I add
> features), maybe I should remove the URL from that tables and store
> them in another table as schema, domain, port, path, query_string,
> auth_user, auth_password. Then I could use the url hash on the other
> tables as foreign key as well as primary key. Am I totally wrong on
> this way?
>
> For the sake of completeness I'm designing for MySQL 5.1, even if I'd
> like to avoid non-standard constructs.
>
> Thank you in advance for any answers.
>
> Lorenzo Lazzeri
>
> P.S. Please forgive my english syntax and don't hesitate to ask me
> clarifications where needed.

Lorenzo, the design criteria for primary keys are: uniqueness, irreducibility, stability, simplicity and familiarity.

Your URLs are unique (so you say), irreducible and familiar. Are they simple and stable?

While a sha1 hash is almost as unique as the original URL, it is not quite as unique, and you have already noted such a hash is not familiar. The hash is neither more stable nor less reducible than the original URL. The question then becomes not only "Is the hash simpler than the URL", but "Is the hash sufficiently simpler to compensate for the loss of familiarity and uniqueness?"

I am unaware of any well known reasons to avoid auto increment numbers. What are they? Received on Wed Oct 06 2010 - 20:14:57 CEST

Original text of this message