Re: Primary Key Theory Question

From: D Guntermann <guntermann_at_hotmail.com>
Date: Tue, 30 Mar 2004 03:55:07 GMT
Message-ID: <HvDEvv.IEH_at_news.boeing.com>


"Bob Hairgrove" <wouldnt_you_like_at_to_know.com> wrote in message news:9aie60p88f5p07u0osaud08km55aq03cri_at_4ax.com...
> On Sun, 28 Mar 2004 18:56:03 GMT, Ben <BenNOSPAM_at_NOSPAMknieff2.com>
> wrote:
>
> >Hello group-
> >I am re-working the structure of a database, and I have decided to do
> >the design work as 'properly' as possible. So, I've been reading through
> >posts in this group for most of a day. While I wait for Celko's books to
> >arrive, I thought I'd put a question to the gurus here.
>
> Although I'm certainly not a "guru", I've been down this path before,
> maybe what I say can be of a little help...
>
> >I've come to a conceptual impasse regarding primary keys. In doing my
> >pen and paper design work and I've come up with two issues: Some of my
> >entities can only be uniquely identified by all of their properties and
> >some of my entities have good candidate keys, but in real life, that
> >data might not be known.
> >
> >For example - I see an address as an entity. The pseudo-DDL would look
> >like this:
> > table Address
> > Street Address 1 alpha,
> > Street Address 2 alpha,
> > City alpha,
> > State alpha,
> > Postal code alpha,
> > Country alpha
> >
> >It seems to me the only unique key would be combination of all of these
> >properties.
>
> Correct. Until we get DNA databases, HR data is very often just like
> this.
>
> >In the past, I would add an integer ID field as the key and
> >leave it at that. I have read numerous postings suggesting that this is
> >poor relational design, many hinging on the fact that the generation of
> >of these IDs tends to be specific to each DBMS vendor. As portability is
> >a concern, I don't want to make a _big_ mistake. Additionally, I have
> >run into issues on the application development side when using generated
> >keys. So, what is the best way to define a primary key for an entity
> >such as this one?
>
> You need a surrogate key. I believe Joe Celko will probably recommend
> some kind of hash function over all the columns ... although the hash
> function will probably be more DBMS specific than a sequence, it does
> have certain advantages (e.g. you might be able to sort more
> efficiently with a hashed key).

Hi Bob.

It seems a hash function over all (or a subset) of columns would suffer from some problems. Why would Celko recommend this over a random generator or a sequence for a surrogate. If one has to access all key values to create the hash, then if the key structure changes, wouldn't one have to change both the underlying logical schema *and* the hash function (i.e. code)? What if other relations/tables have foreign key constraints and a value in one of the column row intersections changes? The propagating affects from a -rehashing refresh- would seem like a nightmare to manage.

Obviously, portability and interoperability of the hashing functioin is an issue if there is vertical fragmentation is employed as well.

Thanks,

Dan
>
> If you want to have a "portable sequence", you can add an integer
> column and select max()+1 of that column when you do an update. Be
> sure that you have a mechanism for locking the table when more than 1
> user is inserting data, though ... this can be a real drag on
> performance.
>

For concurrency control granularities employing row-level locking, max() + 1 could suffer from vulnerabilities in terms of the lost update problem, phantom reads, rollback, etc. One way of getting around this is to have the generator guarantee one and only one issuance of any given value during its lifetime. Thus, when rollbacks or transaction interleaving occurs, even though it might be serializable, duplicates can be avoided. The alternative is to rely on the candidate key constraints to reject a duplicate key value generation upon insert and have the mechanisms in place to get a new value.

> >Example 2 - I see a person as an entity. SSN is a good candidate key for
> >a person, but in my situation, that information will frequently be
> >unknown. I see no other candidate keys. Yet if it is poor practice to
> >'manufacture' a key, what approach should be taken?
>
> SSN is also not a good choice ... it can change, it is often unknown
> (foreigners to the USA and children don't have SSN), and it is not
> guaranteed to be unique in all cases (don't ask me for the source on
> this ... Joe will have a much better answer<g>)
>
> >I fear that I have missed something fundamental in my understanding of
> >relational design, because I know I'm not the only one to come up
> >against this. Any thoughts or resources are very much appreciated.
> >
> >Thank you,
> >Ben
>
> Just my 2 cents' worth ... you'll need some kind of auto-sequence or
> else a "before insert" trigger to generate either a hash key or get
> the next sequence value ... Better yet, do all updates/inserts/deletes
> through some kind of API so that you can better control security (i.e.
> access to the database). Oracle has roles for this sort of thing,
> other databases have similar mechanisms (except for MS-Access, which
> isn't really a database...)
>
>
> --
> Bob Hairgrove
> NoSpamPlease_at_Home.com
Received on Tue Mar 30 2004 - 05:55:07 CEST

Original text of this message