Re: How to Implement Versioned Rows?

From: Jonathan Leffler <jleffler_at_earthlink.net>
Date: Sun, 12 Jan 2003 00:13:33 GMT
Message-ID: <3E20B297.1060604_at_earthlink.net>


Alan Gutierrez wrote:
> My application is gathering personel and patient data for a hospice. It must
> keep track of all changes to patient stats, chart, and med sheet over time.
> Therefore, I would like to keep versions of the rows in many of my tables.

This sounds rather like a temporal database. You should probably review the material in two books (at least):

R T Snodgrass "Developing Time-Oriented Database Applications in SQL" C J Date et al "Temporal Data and the Relational Model".

Both are tough going in their own (very different) ways, but also illuminating.

> This is my stab at a PostgreSQL implementation. I would greatly
> appreciate any input, criticisms, dire warnings, etc.
>
> I plan on adding a transaction table with a transaction_id column fed by a
> sequence. The transaction_id will indicate the order of creation as well as
> time of creation.

So, in general, there'll be many rows identified as being modified by any given transaction ID.

> -- A row for each of our customers who subscribe to the application.
>
> create table firm ( -- firm => hospice, clinic, hospital
> firm_id integer not null,
> name varchar(32), -- Just one example data column
> primary key (firm_id)
> )
> \g

This \g is a PostgreSQL thingy - equivalent to a semi-colon in orthodox SQL?

No temporal/transactional information in the 'firm' table? So you'll never get new clients or lose old ones? Or it doesn't matter whether you track them?

> -- A table to keep row of a transaction.
>
> create table transaction (
> firm_id integer not null,
> transaction_id integer not null, -- Fed by sequence, one for each
> firm so
> -- we can part and merge databases by
> -- firm without collision!
> modified timestamp not null,
> modified_by person_id not null,
> primary key (firm_id, transaction_id),
> foreign key (firm_id) references firm
> )
> \g

Oh, hmm...so the combination of firm_id and transaction_id is unique...that's going to make the sequencing of operations trickier.

> -- Example versioned table.
>
> create table person_history ( -- Base for patient and employee
> firm_id integer not null,
> person_id integer not null,
> transaction_id integer not null,
> first_name varchar(32), -- Just two example data columns
> last_name varchar(32) not null,
> deleted boolean not null,
> primary key (firm_id, person_id, transaction_id)
> )
> \g

In the absence of the tracking, you'd have person_id as the primary key? And is there a cross-reference between the transaction table (modified_by column) and this table? Suppose a new person is created with person ID P1 by a person from firm F1 with transaction ID T1; subsequently, the information is updated by a person from firm F2 with transaction ID T2. Does the primary key now change from {F1,P1,T1} to {F2,P1,T2}? Does that mean all cross-references to {F1,P1,T1} now need to be updated to reference {F2,P1,T2}? Or do they continue to reference just P1 and let the temporal information {F1,T1} vs {F2,T2} go by the wayside?

> -- Show latest row view.
>
> create view person as
> select *
> from person_history
> where transaction_id = ( -- In explain this subselect appears to use
> index!
> select max(transaction_id)
> from person_history as ph1
> where firm_id = ph1.firm_id
> and person_id = ph1.firm_id
> )
> and deleted = 0
> \g

So, after I delete someone, they continue to exist in stasis as of the last update before the delete? This condition is not anywhere near complex enough, I think.

> -- Time travel view.
> create view person_as_of as
> select *
> from person_history
> where transction_id = (
> select max(transaction_id)
> from person_history as ph1
> where firm_id = ph1.firm_id
> and person_id = ph1.firm_id
> and transaction_id <= (
> select transaction_id
> from past_transaction
> limit 1
> )
> )
> and deleted = 0
> \g

This doesn't seem to make sense either; since transaction ID's are only unique within firm, the code above does not prevent mismatching transaction IDs from different firms.

> In my application I can travel in time thus:
>
> create temporary table past_transaction as
> select transaction_id
> from transaction
> where modified <= '2002/12/2 17:59:00' -- the minute I turned 31
> order by firm_id desc, transaction_id desc,
> limit 1
> \g
> -- If only I could pass the view a parameter!
> select * from person_as_of
> \g

Interesting; many databases would not allow you to create a viw on a temporary table.

> Thoughts:
>
> * I can write a query and have it travel back in time by setting one
> variable.
> Neeto.
>
> * No archive tables or such means no copying, new version is a simple
> insert.
> Good.

Read Snodgrass...

> * With expliain the sub selects appear to use the indexes with
> aggregates, and
> if not I can alsways sort descending limit 1. Good.
>
> * Even with complex joins on the latest view tables the query plans use
> the primary index for the sub select. Good.
>
> * There is little need for vacuuming, since no updates are made to the
> busy tables of the application. Does this matter?
>
> * Referenital integrity goes away from what I can see, since it won't
> understand the deleted column. Pity.
>
> Questions:
>
> * Is this viable or overly clever?

Gut feel: not really viable. And both overly clever and not yet clever enough.

> * Should I have a boolean latest column on a versioned table? This would
> mean
> update and vacuum, but potentially a faster query.

It converts every non-SELECT operation into a double operation - one to update the 'latest version' flag on the old record plus the operation that creates the new record (or removes the old record).

> * Is there a penalty for long (how do you say?) concatenated keys in
> PostgreSQL?

In general terms, if the rest of the design is OK, then the performance penalty for bigger concatenated keys should not be much of an issue. Of course, the comparisons are harder - they have to check 3 fields to establish equality (though often just one field to establish inequality). But there should not be a huge impact... Maybe...

> * Any reason why this won't work with the coming distrubuted PostgreSQL?

The only problem with a distributed version of the database is ensuring that the sequences for the transaction IDs are properly managed across multiple machines. That is likely to be tricky.

> * Where can I read about alternative implemenations for history/versioning?
> Nuances?

See books above, and the (extensive) references in each volume.

> Thank you all for any input whatsoever.
>
> Alan Gutierrez
>

-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler_at_earthlink.net, jleffler_at_us.ibm.com
Guardian of DBD::Informix 1.04.PC1 -- http://dbi.perl.org/
Received on Sun Jan 12 2003 - 01:13:33 CET

Original text of this message