Re: what are keys and surrogates?

From: rpost <rpost_at_pcwin518.campus.tue.nl>
Date: Fri, 04 Jan 2008 22:37:19 +0100
Message-ID: <bf6b0$477ea70f$839b4533$2250_at_news1.tudelft.nl>


JOG wrote:

>> >1) A 'Key' is a set of attributes, used at the logical layer to
>> >uniquely identify a proposition.
>> >2) An 'Identifier' is a set of attributes, used at the conceptual
>> >layer to uniquely identify an entity.
>>
>> Why use two different terms?
>
>Because they apply to different things, propositions and entities.
>Don't confuse the two.

What is the difference?

For me, an entity is a proposition treated as a thing (i.e. used as a reference rather than a description).

>> The definition I was taught is different: a key is a constraint on
>> a relation, namely a set of attributes such that any two different
>> tuples in the relation differ in at least one of these attributes.
>
>Makes sense, but whoever mentioned relations? My definitions were
>generalized, not specific to E/R or RM.

E/R and RM is as generalized as it gets, as far as I'm concerned. Unless you have some other mathematical formulation in mind.

>> This is quite different from your notion,
>

[...]

OK, I must have misunderstood what you wrote.

>> >* A 'Surrogate Identifier' is an attribute designated at design time
>> >to stand in place of natural identifiers that are either: (i)
>> >Impractical to observe, (ii) Impractical to manipulate.
>>
>> My try: an "artificial" identifier is an relation attribute
>> that is a key, and on which equality is the only meaningful operation.
>> (I.e. it only serves to cross-identify the tuples of the relation with
>> something else, e.g. tuples in another relation, or something outside
>> the database).
>>
>> A "surrogate" identifier is a *hidden* artificial
>> identifier, i.e. even the database queries only use them in equalities.
>
>No attributes should ever be hidden. Ever. If they are, they can be
>deleterious to the system. Myself and Brian had a /long/ debate about
>this in the past, and one of the reasons I still consider consider his
>points with diligence and an open mind, is that he showed a lot of
>balls towards the end when he acceded the logic of that point -
>although we heartily debate many others still ;)

So I looked that up: you appear to be referring to

  http://groups.google.com/group/comp.databases.theory/msg/7c0de7c37713bf0a

What I'm seeing in that thread is a single person (Brian Seltzer) who simply states the fundamental cornerstone of not only relational databases, but predicate logic in general (namely, the fundamental "hiddenness" of attribute values, a principle that in many theoretical articles isn't just reserved for special "surrogate" attributes, but is simply assumed for all attributes), and keeps getting thrashed by the rest of you lot until he's so dizzy that "a light bulb comes on" and he "concedes". No one in that entire thread appears to understand what it means for a surrogate to be hidden: Brian is quite close in his descriptions, but doesn't find the switch, while you and Bob Badour are outright wrong about what it means for an attribute to be "hidden" or "a surrogate", because you overlook the possibility of "perfect auto-id" attributes that can only be created by automatic generation and can only be compared for equality. These are standard in relational database systems and were introduced into database theory by this article:

  http://citeseer.ist.psu.edu/context/21993/0

You may have arguments against them (I have some of my own) and some of what is said in that thread may apply to them, but the thread doesn't even mention them. Brian's eventual argument for conceding doesn't apply to them.

>I certainly can't face going through that battle all again I'm afraid,
>safe to say that as a programmer I understand the intuition behind
>what you think (after all I believe Codd made the same error in RM/T),
>but know that it is a mistake and can end up in situations where the
>database gets screwed up.

I'm not sure I want to take up the battle axe, but your database is safe.

>> (So unless we have specific value domains to safeguard this, this is
>> a property of the queries we use rather than the attribute itself.)
>>
>> >* Once a 'Surrogate Identifier' has been created it merely becomes an
>> >unfamiliar 'Natural Identifier' (it too must become an observable
>> >attribute of the entity it applies to).
>>
>> This is not true: they can serve as foreign keys without ever being
>> published (if your queries are careful enouygh). I'd like to use
>> a separate term for arbitrary but published identifiers.
>
>As I previously stated, no attributes should ever be hidden. A tuple
>is merely an encoded proposition. That proposition orginates prior to
>entry in the database, and as such all of its contents come from the
>outside world.

Yes, that is your dogma. Making that assumption doesn't make it a necessity. Making valid arguments for it might: I haven't seen any.

>In another thread I used a car example to argue against
>OID's, but one might have used a hidden surrogate instead of a VIN
>attribute, and ended up with the same broken database.

Yes, when you confuse hidden attributes with published, assignable ones, as you stubbornly insist on doing, things will break horribly.

>> >* A 'Surrogate key' is merely a key that contains an attribute that
>> >was a surrogate identifier at the conceptual layer.
>>
>> For me there is no such thing as a surrogate identifier at the
>> conceptual layer. A surrogate is a "1NF" counterpart of an
>> arrow in a conceptual model.
>
>Surrogate means in "place of".

Exactly: to be more exact, it stands for a reference to the whole tuple.

>Your phone company quite sensibly
>chooses a customerID for you "in place of" of your naturally
>identifying dna code. Once the surrogate identifier is created it
>becomes another statable natural attribute (and indeed the phone
>company asks for it when you call up to complain about your bill).

That is why it isn't hidden. Its not being hidden fundamentally breaks it as a surrogate: I can now pose as another customer by using his customerID, and customerIDs will start to accrue all kinds of additional meaning (e.g. if they are issued in increasing order, applications may start to rely on that being the case in one way or another).

>It is unfamiliar compared to your other attributes, an no doubt like me
>you have to check the top of your bills to find out what it is. And
>this all has nothing to do with databases, it all happens outside.

Hidden attributes, on the other hand, never leave the database; in queries they are only used in equality tests (joins). This is why it makes sense to distinguish hidden attributes from published "arbitrary" identifiers.
I don't blame you for having overlooked this, it wasn't exactly obvious to me when I first read about it, but I hope you will in the future.

>Surely that is all straightforward and entirely sensible!

Yes, but it also completely misses the point.

>Merry Xmas, J.

Belated best wishes for 2008 ...

-- 
Reinier
Received on Fri Jan 04 2008 - 22:37:19 CET

Original text of this message