Re: Help finding natural keys

From: Tim X <timx_at_spamto.devnul.com>
Date: 19 Jan 2003 13:42:44 +1100
Message-ID: <878yxhn9ln.fsf_at_tiger.rapttech.com.au>


>>>>> "Alan" == Alan Gutierrez <ajglist_at_izzy.net> writes:

 Alan> Tim Cross wrote:
>>>>>>> "Alan" == Alan Gutierrez <ajglist_at_izzy.net> writes:

 Alan> ISSUE 2
>>
>> >> What I found of far more concern was the way you generate your
>> >> surrogate keys. Determineing a new primary key by looking at
>> >> the largest current primary key and adding one to it is very
>> >> dangerous. The big danger is that first selecting the maximum
>> >> value from a field, adding some amount to it and then inserting
>> >> the resulting key back into the table is not an atomic action.
>>
 Alan> INSERT INTO Patient (firm_id, patient_id) VALUES (:firm_id, 0)  Alan> ;
>>

 Alan> The above is atomic in PostgreSQL. The first transaction to
 Alan> execute this statement will block any other transaction
 Alan> attempting to execute this statement until this first
 Alan> transactions rolls back or commits.

>>
 Alan> UPDATE Patient SET patient_id = (SELECT MAX(patient_id) FROM
 Alan> Patient AS P1 WHERE P1.firm_id = firm_id) WHERE firm_id =
 Alan> :firm_id AND patient_id = 0
 Alan> ;

>>

 Alan> Voila.

>> The actual insert/update operations might be atomic in themselves,
>> but not between each other. For example, in your above example,
>> what would happen if another process has inserted a new record
>> after you inserted a new record, but before you issue the update
>> command to set the patient id?

 Alan> In PostgreSQL the second process would wait until the first  Alan> process ended it's transaction.

>> My interpretation of what you have above would indicate the
>> possibility both records would get the same patient id - the
>> possibility might be slight because of the firm_id part in the
>> where clause and it is 'unlikely' two records from the same firm
>> will be inserted and waiting for update at the same time. However
>> 'unlikely' is not never.

 Alan> I understand race conditions. I have tested this sequence of
 Alan> statments to ensure that with concurrent transactions, the
 Alan> first one to preform the insert statement will force other
 Alan> transactions to wait until transaction end to resume
 Alan> execution. This method for generating unique ids does work.


>> >> Most commercial (and even most free) DBMS have some facility
>> >> for generating id numbers which are guaranteed to be unique
>> >> (e.g. sequence). If your system has such a facility, use it
>> >> rather than the max(some_column)+1 approach. Apart from
>> >> guaranteeing your id numbers will be unique, it will probably
>> >> be faster than doing a max() lookup followed by an insert.
>>
 Alan> I'm going to put more thought into how I generate my surrogate
 Alan> keys. You are not the only one to raise and eye-brow to SELECT
 Alan> MAX. I chose this over sequences since a sequence was one more
 Alan> database object to maintain. This is an area where I have Alan>

>> research to do.
>>
>> Trust me on this one. If your database provides some sort of
>> object for creating unique values in an atomic manner, use it. The
>> maintenance overhead will be far less than dealing with problems
>> due to transactions which attempt to use an existing unique
>> value. From your clients perspective, it will impress them a lot
>> less if they need to keep contacting you to ask why once or twice
>> a week they are unable to insert new records etc because inserts
>> attempt to use a non-unique value as the primary key.
 Alan> The maintianence overhead of sequences still exists, and the
 Alan> problem of generating duplicate keys has been resolved. Are
 Alan> sequences still the better bet?

 Alan> Thank you for your concern. Data races and deadlock are a
 Alan> frequently a source of trouble for those new to multi-process
 Alan> environments. They still trip me up on (not-so-rare)
 Alan> occasion. I appreciate all the frantic hand waving. I'll post a  Alan> sanity check over at a PostgreSQL forum. But this does work.

First, I preface what follows with the admission I know nothing about postgres and what follows is based largely on assumption.

I would still go with the sequence and here is why -

  1. Sequences are designed to fulfill exactly the role you are after.
  2. The maintenance of a sequence is minimal. Essentially, you just define the sequence and then use it. The only real maintenance would be watching for when you run out of sequence numbers and thats not very different to watching for when your id get too large for the column.
  3. Your method relies on how the database is implemented and is not portable. If you had to move to a new dbms, it is possible it would not work. For example, in Oracle, writers do not block readers and readers do not block writers, so your solution would possibly result in duplicate id numbers being generated.
  4. If your solution does work under postgres, it indicates that writers block other writers and readers. The system probably uses table level locking when doing inserts/updates etc. This approach is fairly common because row level locking is difficult to implement well and reliably. However, table level locking is inefficient and prevents the system from being able to exploit concurrency efficiently because access becomes serialised. It is therefore possible postgres will change its implementation in the future to improve its performance. When this happens, your method of implementing unique id values is likely to break. Sequences on the other hand are guaranteed to provide unique values, so even if the way they are implemented changes, it is not going to affect how you use them.
  5. Sequences are likely to be far more efficient than your technique. While your method probably won't involve a full table scan because its based on primary keys which are usually indexed etc, a sequence is usually designed to provide efficient generation of new values and is likely more efficient than finding the max value in a column (of course this assumption depends on a number of unknown variables such as the type of index used, caching locking etc.).

Finally, you say you have tested your technique to make sure it works. I just want to point out how difficult it is to test for something like this. It is not sufficient to just create a number of concurrent sessions and try it out because there are too many variables which can affect the results, machine speed, number of cpus, amount of memory, network latency, load from other processes, timing etc. Really the only way to be certain it works is to do an in-depth analysis of how the dbms works, its locking mechanisms etc. This would involve an audit of the code etc. I'm not trying to be critical of your testing skills etc, simply just point out how complex it can be to make a solid determination that any form of race condition does not exist - you really need to prove it theoretically rather than practically based on the source code. Note also that you do NOT need to have multiple processors for this situation to occur. It can happen with a single processor because there is no guarantee on the order of execution of commands between two processes on a single processor - it all depends on issues like how the OS schedules jobs, priorities of the processes etc.

HTH Tim

-- 
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you 
really need to send mail, you should be able to work it out!
Received on Sun Jan 19 2003 - 03:42:44 CET

Original text of this message