Re: Surrogate Keys: an Implementation Issue

From: Brian Selzer <>
Date: Thu, 27 Jul 2006 09:09:33 GMT
Message-ID: <hp%xg.138387$>

"paul c" <> wrote in message news:lWXxg.243067$iF6.132287_at_pd7tw2no...
> Brian Selzer wrote:

>> "Bob Badour" <> wrote in message 
>> news:PPMxg.21411$
>>> Brian Selzer wrote:
>>>> "Bob Badour" <> wrote in message 
>>>> news:X1gxg.15744$
>>>>> Brian Selzer wrote:
>>>>>> "Bob Badour" <> wrote in message 
>>>>>> news:XQ5xg.14298$
>>>>>>> Brian Selzer wrote:
>>>>>>>> "Bernard Peek" <> wrote in message 
>>>>>>>>> In comp.databases.theory Paul Mansour wrote:
>>>>>>>>>> The second is perhaps more profound. Consider a rollback 
>>>>>>>>>> database, or
>>>>>>>>>> a database that must provide a complete audit trail of every 
>>>>>>>>>> change.
>>>>>>>>>> For example, the database must provide the answer to "who changed
>>>>>>>>>> this SS number from  X to Y, and when did they change it? As far 
>>>>>>>>>> as I
>>>>>>>>>> can tell, if there is no way to answer this without an immutable
>>>>>>>>>> identifier. ( I suppose you could design the DB to handle 
>>>>>>>>>> specific
>>>>>>>>>> cases, but I'm interested in DBMS with native rollback and audit
>>>>>>>>>> trail support.)
>>>>>>>>> If what you think of as the key is mutable it's not a natural key.
>>>>>>>>> Natural keys aren't mutable, at all, ever.
>>>>>>>> Untrue!  Natural keys are often mutable--especially compound keys.
>>>>>>>>> The problem here is that there isn't a usable natural key for
>>>>>>>>> identifying people. So all that's left are surrogates of varying
>>>>>>>>> quality. The SSN is a surrogate that works most of the time. If 
>>>>>>>>> you
>>>>>>>>> assign someone a payroll number and tell tham that thay won't get 
>>>>>>>>> paid
>>>>>>>>> unless they can quote it then you have a close to immutable key. 
>>>>>>>>> But
>>>>>>>>> it's still a surrogate.
>>>>>>>>> I think the original post was part correct. Use a natural key when 
>>>>>>>>> you
>>>>>>>>> have one. I'd just add to that, don't use a surrogate unless there 
>>>>>>>>> is a
>>>>>>>>> compelling reason for it. Efficiency is unlikely to be a 
>>>>>>>>> compelling
>>>>>>>>> argument in most cases.
>>>>>>>> There is always a compelling reason to use surrogates: natural keys 
>>>>>>>> can change.  This makes it difficult--if not impossible--to detect 
>>>>>>>> changes to rows.  For example, Bob is preparing to update a 
>>>>>>>> row--that is, he has read the row and is in the process of keying 
>>>>>>>> in a change. During that time, another process updated several rows 
>>>>>>>> in the same table.  Unless the key is immutable, when Bob issues 
>>>>>>>> the update, there's no guarantee that the row he's updating is the 
>>>>>>>> same one that was read out.
>>>>>>> So? Bob issues an update statement to change some attribute 
>>>>>>> identified by a logical identifier. As long as the data Bob is 
>>>>>>> entering is correct, why should he care what happens to any other 
>>>>>>> attributes?
>>>>>> He may be overwriting a change made by another user.
>>>>> The only relevant concern is whether the update is correct. If the 
>>>>> update is correct, it matters not at all that it overwrites something, 
>>>>> which is presumably no longer correct.
>>>>>>> This problem is magnified
>>>>>>>> if there are rows related via a foreign key constraint because it's 
>>>>>>>> possible for the referenced row to appear unchanged.  So you're 
>>>>>>>> left with either maintaining an exclusive lock on the row until Bob 
>>>>>>>> returns
>>>>>>> >from the golf outing, or adding additional columns and code in 
>>>>>>> >order to
>>>>>>>> determine with certainty whether or not a change occurred between 
>>>>>>>> the time that a row was read and the time of the update.
>>>>>>> Or you can just write better applications that don't update anything 
>>>>>>> that didn't change. By introducing some 'under-the-covers' 
>>>>>>> identifying attribute, you create a risk that Bob will change some 
>>>>>>> data identified by a familiar logical identifier and some other 
>>>>>>> process in the meantime will associate that identifier with a 
>>>>>>> different surrogate. Your application will then record the updates 
>>>>>>> against the wrong logical identifier.
>>>>>> You're missing the point.  The change Bob's making may change the 
>>>>>> row, but because several changes occurred to the table while Bob was 
>>>>>> keying in his change, the row he's about to change may represent some 
>>>>>> other entity altogether.
>>>>> Using logical identity and natural keys, that's not possible. The key 
>>>>> that Bob specifies identifies the entity. And if Bob updates the 
>>>>> entire row, the entire row needs updating. I draw your attention, in 
>>>>> particular, to my observation that one can just write better 
>>>>> applications that don't update anything that didn't change.
>>>> Of course it's possible.  The key that Bob specifies identifies the 
>>>> entity at the time of the read from the database.  By the time that the 
>>>> write occurs, that same key now refers to a different entity.
>>> Are you stupid or just ignorant? What you state above can only happen if 
>>> one assumes one is using a hidden surrogate. If one uses explicit values 
>>> that Bob interacts with, then the value identifies a single entity. That 
>>> identity does not change.
>> From Bob's perspective, what he would expect is that the entity that he's 
>> working on, identified by the key values he used to retrieve it, would 
>> not change; however, since the database is no longer the same when Bob 
>> attempts to commit, the entity that he was working on may either no 
>> longer exist, or may have been changed by another user.  During that same 
>> interval, a second entity may have been changed so that by the time Bob 
>> attempts to commit, that second entity may be identified with respect to 
>> the new database state by the exact same key values that Bob read out.
>> ...

> I think this is misreading the information principle and perhaps it is due
> to falling into the trap of thinking typical lock manager behaviour has
> something to do with RT. Even if it's not that, it's a mistake. It
> doesn't matter if the rows involved (not necessarily the ones updated)
> were deleted twenty times and re-inserted. If the invoking transaction
> were required to re-iterate the 'rows' it read whose values it considered
> to be crucial for a correct update, then there would less chance of
> falling into this trap.

The problem is with the theory, not the mechanisms used to get around its limitations. As I stated before, the Relational Model doesn't take into account duration. What do I mean by that? The Relational Model takes into consideration only two successive database states: the current state, which is always consistent, and the proposed state, which may or may not be consistent. These states are separated by a single operation, either a relational assignment or a multiple assignment. If the proposed state violates the database predicate, then it is rejected, otherwise, upon completion of the operation, the proposed state becomes the current state. The operation occurs instantaneously (at least from a theoretical standpont), meaning that it has no duration; however, there are no simultaneous operations, and order is important. (In a concurrent environment, simultaneous assignments are aggregated into a single multiple assignment.) Relational assignment is set-based--the new relation value replaces the old relation value; multiple assignment is a set of relational assignments and is also set-based--the set of new relation values replaces the set of old relation values. There are no row-based operations. This poses several problems. I'll use contrived examples for brevity to illustrate these problems. To be useful, a database is a representation of some aspect of reality. Each table is a container for a specific type of entity, and each row represents a real instance of that type of entity. Consider the following states for a table that contains people:

Current: Jane Jones Married Proposed: Jane Smith Divorced

Do Jane Jones and Jane Smith represent different people? Or did Jane Jones get Divorced? Because the only key is mutable, there isn't enough information to answer that. Another problem involves temporal constraints. Consider the following states for the people table and a transition constraint, Single people can't become Divorced:

Current: Jane Jones Married        Proposed: Jane Jones Married
Current: Jane Smith Single           Proposed: Jane Smith Divorced

Should the proposed state be rejected? Or is it possible that Jane Jones got Divorced becoming Jane Smith and Jane Smith married Bob Jones? Because the only key is mutable and because there or no row-based operations, there isn't enough information to answer that. The lack of row-based operations means that if all keys are mutable, it is not possible to enforce transition constraints because there is no way to correlate the rows in the Current state with those in the Proposed state. Another problem involves the situation I described above. Add the following states for a children table:

Read: Jane Jones Brian             Write: Jane Jones Beth
Read: Jane Jones Lynn             Write: Jane Smith Brian
                                                Write: Jane Smith Lynn

When Bob reads Jane Jones' row from the people table, Jane Jones has two children, Brian and Lynn, but when Bob is ready to update the people table, Jane Jones only has one child, Beth. Clearly these are two different people, even though the row in the people table is identical. Because the only key is mutable, Bob will be updating the wrong row.

It should be obvious that surrogates solve these problems.

> p
Received on Thu Jul 27 2006 - 11:09:33 CEST

Original text of this message