Re: implementing a database log

From: David BL <davidbl_at_iinet.net.au>
Date: Sun, 27 Apr 2008 23:07:26 -0700 (PDT)
Message-ID: <ae011bd0-63c3-4373-ba35-2fe9fe6be331_at_26g2000hsk.googlegroups.com>

On Apr 28, 12:25 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> news:289d3678-7a00-463e-aa99-33c7944ce68b_at_27g2000hsf.googlegroups.com...
>
>
>
>
>
> > On Apr 24, 10:11 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >>news:fd63f466-18f7-4986-b378-f5e9f512bbd8_at_r66g2000hsg.googlegroups.com...
>
> >> > On Apr 22, 11:39 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> >> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >> >>news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4_at_m3g2000hsc.googlegroups.com...
>
> >> >> > On Apr 22, 6:58 am, "Brian Selzer" <br..._at_selzer-software.com>
> >> >> > wrote:
> >> >> >> "Christoph Rupp" <cruppst..._at_gmail.com> wrote in message
>
> >> >> >>news:91140c56-1f05-4b5d-b45f-b34920db2051_at_x41g2000hsb.googlegroups.com...
>
> >> >> >> > Brian,
>
> >> >> >> > On Apr 21, 11:00 pm, "Brian Selzer" <br..._at_selzer-software.com>
> >> >> >> > wrote:
> >> >> >> >> Why not go with #4:
>
> >> >> >> >> 4. a physical log based on modified rows. Whenever a row is
> >> >> >> >> modified,
> >> >> >> >> added
> >> >> >> >> or removed, it is logged. Then you could also implement row
> >> >> >> >> versioning--just add a row version field to the physical rows.
> >> >> >> >> I
> >> >> >> >> believe
> >> >> >> >> that this what snapshot isolation is built on.
>
> >> >> >> > It's not an SQL database, i don't even have the notion of "rows",
> >> >> >> > but
> >> >> >> > basically i think your #4 is the same as my #1 or #2.
>
> >> >> >> No, it isn't. #1 requires the logging of additional records that
> >> >> >> may
> >> >> >> not
> >> >> >> have been affected by an update. #2 doesn't log the entire changed
> >> >> >> record,
> >> >> >> but only bits and pieces. I would think that limiting the units of
> >> >> >> change
> >> >> >> to individual records--entire records--would simplify the process
> >> >> >> of
> >> >> >> marking
> >> >> >> and isolating units of work while at the same time guaranteeing
> >> >> >> consistency.
>
> >> >> > I don't think an atomic unit of work is always associated with a
> >> >> > change to an individual record. Are you suggesting transactions to
> >> >> > define arbitrarily large units of work aren't needed?
>
> >> >> No, that's not what I'm suggesting. What I'm suggesting is that the
> >> >> atomic
> >> >> unit of work should be a /set/ of /records/--either the old records in
> >> >> the
> >> >> case of a before image or the new records in the case of an after
> >> >> image.
>
> >> > Ok but that sounds like the system snapshots an entire table in the
> >> > before/after images.
>
> >> > For efficiency one would expect to only store the set of records that
> >> > have been added and the set that have been removed by a given
> >> > transaction. It is easy to see how we get the inverse which is
> >> > required for roll back of an uncommitted transaction during recovery.
> >> > However these logical operations aren't idempotent (at least for bags
> >> > of records). How does recovery deal with non-idempotent redo()/
> >> > undo() changes in the log?
>
> >> Why would there be bags of records? At the physical level, each record
> >> has
> >> a specific offset in some file, and that offset would uniquely identify
> >> it.
> >> Why would you strip off that identification? Consequently, there
> >> wouldn't
> >> be bags of records, only sets of records.
>
> > One could say the log records are representing physical changes if
> > they record the physical locations.
>
> >> If you start out with a set of records and you know what is to be
> >> inserted,
> >> updated, and deleted, you can compute the resulting set of records. If
> >> you
> >> start out with the resulting set of records, and you know what was
> >> inserted,
> >> updated and deleted in order to arrive at that result, then you can
> >> compute
> >> the original set of records. For simplicity, even if it isn't
> >> necessarily
> >> the most efficient, what is updated could be implemented as a set of
> >> ordered
> >> pairs of records, tying each original record to its replacement. So the
> >> log
> >> would consist of a sequence of triples (D, U, I) separated by transaction
> >> markers where D is a set of records that were deleted, U is a set of
> >> pairs
> >> of records that were updated, and I is a set of records that were
> >> inserted.
>
> >> Now, provided that the log is written before the database--that is, (1)
> >> write the triple to the log, (2) write the database, (3) write the
> >> transaction marker in the log--, it should be possible to determine
> >> whether
> >> or not what was written to the log actually made it into the database,
> >> and
> >> thus it should be possible to roll back any uncommitted transaction.
>
> > A modern DBMS using WAL will tend to use a "lazy writer" that writes
> > dirty pages to disk in the background. For performance this will
> > tend to write dirty pages in a physical order so the disk head moves
> > uniformly over the platter. This assumes the only constraint between
> > the writing to the log and the dirty pages is the WAL constraint - ie
> > dirty pages must be written to disk strictly after the associated log
> > records have been written to disk. To meet this constraint the lazy
> > writer simply needs to ignore dirty pages that haven't (yet) been
> > logged.
>
> > Your suggestion brings far more onerous constraints on the disk
> > writing policies.
>
> No it doesn't. It only requires that the data be written to disk before the
> transaction marker is written to the log. Whether the disk writes are
> queued up to be written lazily is a separate issue.

When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers.

> > In fact to make a transaction durable it seems
> > necessary to write all the dirty pages to disk as part of the
> > transaction. This is certainly not normally the case in a modern
> > DBMS. A database that avoids a "force" policy requires REDO during
> > recovery.
>
> > Often a DBMS that supports long running transactions will allow dirty
> > pages to be written to disk even though the associated transaction
> > hasn't yet committed. This is the so called "steal" policy and leads
> > to the need for UNDO during recovery.
>
> The only way this can happen is if what was replaced as a result of the
> writes is cached while the transactions are outstanding. In order to
> improve performance in the presence of long-running transactions, it may be
> necessary to maintain that cache on disk. I should emphasize here that such
> caching is strictly a mechanism for improving performance.

With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction.

> > AFAIK most databases have very flexible disk writing policies (steal/
> > no-force) and require both UNDO and REDO during recovery. See the
> > ARIES algorithm.
Received on Mon Apr 28 2008 - 08:07:26 CEST

This message: [ Message body ]
Next message: Brian Selzer: "Re: implementing a database log"
Maybe in reply to: Christoph Rupp: "implementing a database log"
In reply to Brian Selzer: "Re: implementing a database log"
Next in thread: Brian Selzer: "Re: implementing a database log"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message