Re: implementing a database log
Date: Mon, 28 Apr 2008 19:49:41 -0700 (PDT)
Message-ID: <e550a0db-d3cc-4fc1-992b-1ccac7f08b32_at_l64g2000hse.googlegroups.com>
On Apr 28, 9:25 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> news:ae011bd0-63c3-4373-ba35-2fe9fe6be331_at_26g2000hsk.googlegroups.com...
>
>
>
>
>
> > On Apr 28, 12:25 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >>news:289d3678-7a00-463e-aa99-33c7944ce68b_at_27g2000hsf.googlegroups.com...
>
> >> > On Apr 24, 10:11 pm, "Brian Selzer" <br..._at_selzer-software.com> wrote:
> >> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >> >>news:fd63f466-18f7-4986-b378-f5e9f512bbd8_at_r66g2000hsg.googlegroups.com...
>
> >> >> > On Apr 22, 11:39 pm, "Brian Selzer" <br..._at_selzer-software.com>
> >> >> > wrote:
> >> >> >> "David BL" <davi..._at_iinet.net.au> wrote in message
>
> >> >> >>news:265621e9-25fd-46d5-9e2d-7a4f63fa84b4_at_m3g2000hsc.googlegroups.com...
>
> >> >> >> > On Apr 22, 6:58 am, "Brian Selzer" <br..._at_selzer-software.com>
> >> >> >> > wrote:
> >> >> >> >> "Christoph Rupp" <cruppst..._at_gmail.com> wrote in message
>
> >> >> >> >>news:91140c56-1f05-4b5d-b45f-b34920db2051_at_x41g2000hsb.googlegroups.com...
>
> >> >> >> >> > Brian,
>
> >> >> >> >> > On Apr 21, 11:00 pm, "Brian Selzer"
> >> >> >> >> > <br..._at_selzer-software.com>
> >> >> >> >> > wrote:
> >> >> >> >> >> Why not go with #4:
>
> >> >> >> >> >> 4. a physical log based on modified rows. Whenever a row is
> >> >> >> >> >> modified,
> >> >> >> >> >> added
> >> >> >> >> >> or removed, it is logged. Then you could also implement row
> >> >> >> >> >> versioning--just add a row version field to the physical
> >> >> >> >> >> rows.
> >> >> >> >> >> I
> >> >> >> >> >> believe
> >> >> >> >> >> that this what snapshot isolation is built on.
>
> >> >> >> >> > It's not an SQL database, i don't even have the notion of
> >> >> >> >> > "rows",
> >> >> >> >> > but
> >> >> >> >> > basically i think your #4 is the same as my #1 or #2.
>
> >> >> >> >> No, it isn't. #1 requires the logging of additional records
> >> >> >> >> that
> >> >> >> >> may
> >> >> >> >> not
> >> >> >> >> have been affected by an update. #2 doesn't log the entire
> >> >> >> >> changed
> >> >> >> >> record,
> >> >> >> >> but only bits and pieces. I would think that limiting the units
> >> >> >> >> of
> >> >> >> >> change
> >> >> >> >> to individual records--entire records--would simplify the
> >> >> >> >> process
> >> >> >> >> of
> >> >> >> >> marking
> >> >> >> >> and isolating units of work while at the same time guaranteeing
> >> >> >> >> consistency.
>
> >> >> >> > I don't think an atomic unit of work is always associated with a
> >> >> >> > change to an individual record. Are you suggesting transactions
> >> >> >> > to
> >> >> >> > define arbitrarily large units of work aren't needed?
>
> >> >> >> No, that's not what I'm suggesting. What I'm suggesting is that
> >> >> >> the
> >> >> >> atomic
> >> >> >> unit of work should be a /set/ of /records/--either the old records
> >> >> >> in
> >> >> >> the
> >> >> >> case of a before image or the new records in the case of an after
> >> >> >> image.
>
> >> >> > Ok but that sounds like the system snapshots an entire table in the
> >> >> > before/after images.
>
> >> >> > For efficiency one would expect to only store the set of records
> >> >> > that
> >> >> > have been added and the set that have been removed by a given
> >> >> > transaction. It is easy to see how we get the inverse which is
> >> >> > required for roll back of an uncommitted transaction during
> >> >> > recovery.
> >> >> > However these logical operations aren't idempotent (at least for
> >> >> > bags
> >> >> > of records). How does recovery deal with non-idempotent redo()/
> >> >> > undo() changes in the log?
>
> >> >> Why would there be bags of records? At the physical level, each
> >> >> record
> >> >> has
> >> >> a specific offset in some file, and that offset would uniquely
> >> >> identify
> >> >> it.
> >> >> Why would you strip off that identification? Consequently, there
> >> >> wouldn't
> >> >> be bags of records, only sets of records.
>
> >> > One could say the log records are representing physical changes if
> >> > they record the physical locations.
>
> >> >> If you start out with a set of records and you know what is to be
> >> >> inserted,
> >> >> updated, and deleted, you can compute the resulting set of records.
> >> >> If
> >> >> you
> >> >> start out with the resulting set of records, and you know what was
> >> >> inserted,
> >> >> updated and deleted in order to arrive at that result, then you can
> >> >> compute
> >> >> the original set of records. For simplicity, even if it isn't
> >> >> necessarily
> >> >> the most efficient, what is updated could be implemented as a set of
> >> >> ordered
> >> >> pairs of records, tying each original record to its replacement. So
> >> >> the
> >> >> log
> >> >> would consist of a sequence of triples (D, U, I) separated by
> >> >> transaction
> >> >> markers where D is a set of records that were deleted, U is a set of
> >> >> pairs
> >> >> of records that were updated, and I is a set of records that were
> >> >> inserted.
>
> >> >> Now, provided that the log is written before the database--that is,
> >> >> (1)
> >> >> write the triple to the log, (2) write the database, (3) write the
> >> >> transaction marker in the log--, it should be possible to determine
> >> >> whether
> >> >> or not what was written to the log actually made it into the database,
> >> >> and
> >> >> thus it should be possible to roll back any uncommitted transaction.
>
> >> > A modern DBMS using WAL will tend to use a "lazy writer" that writes
> >> > dirty pages to disk in the background. For performance this will
> >> > tend to write dirty pages in a physical order so the disk head moves
> >> > uniformly over the platter. This assumes the only constraint between
> >> > the writing to the log and the dirty pages is the WAL constraint - ie
> >> > dirty pages must be written to disk strictly after the associated log
> >> > records have been written to disk. To meet this constraint the lazy
> >> > writer simply needs to ignore dirty pages that haven't (yet) been
> >> > logged.
>
> >> > Your suggestion brings far more onerous constraints on the disk
> >> > writing policies.
>
> >> No it doesn't. It only requires that the data be written to disk before
> >> the
> >> transaction marker is written to the log. Whether the disk writes are
> >> queued up to be written lazily is a separate issue.
>
> > When is the transaction deemed to have committed? I was assuming your
> > transaction marker was used to commit the transaction. If not what is
> > it for? Check pointing? I think there are better ways to check
> > point than that. AFAIK there is no need to check point *individual*
> > transactions with special markers.
>
> The transaction marker is used to indicate that the transaction is complete.
> What that means is that the changes that are called for in the triple have
> been written to the database. It's not enough to simply write the log and
> then write the database, because without the marker following the triple in
> the log, there is no way to be sure that the changes called for in the
> triple actually made it into the database.
Ok so it relates to check pointing. I don't see how these markers are
particularly useful - ie to determine where to start replaying the log
during recovery. For example, when ARIES performs a check point it
records
This allows the recovery scan to determine where to start replaying
> >> > In fact to make a transaction durable it seems
> >> > necessary to write all the dirty pages to disk as part of the
> >> > transaction. This is certainly not normally the case in a modern
> >> > DBMS. A database that avoids a "force" policy requires REDO during
> >> > recovery.
>
> >> > Often a DBMS that supports long running transactions will allow dirty
> >> > pages to be written to disk even though the associated transaction
> >> > hasn't yet committed. This is the so called "steal" policy and leads
> >> > to the need for UNDO during recovery.
>
> >> The only way this can happen is if what was replaced as a result of the
> >> writes is cached while the transactions are outstanding. In order to
> >> improve performance in the presence of long-running transactions, it may
> >> be
> >> necessary to maintain that cache on disk. I should emphasize here that
> >> such
> >> caching is strictly a mechanism for improving performance.
>
> > With a general purpose DBMS using strict 2PL there is inevitably a
> > chance of dead-lock and the need to abort transactions. This in turn
> > requires UNDO of all changes made by a particular transaction. You
> > can either record before images of pages of uncommitted transactions,
> > or else use the log directly to roll back a transaction. The latter
> > is economical with caching of the tail of the log and backward
> > chaining of log records for a given transaction.
>
> Ideally, the dbms engine would serialize physical transactions. The before
> images of uncommitted transactions could be read by other transactions
> instead of the database during writes to the database. Since what is
> written to the log is just the changes required by a particular transaction,
> it should be possible to cache the changes required by other transactions
> while each is committed in turn. Note that by 'serialize physical
> transactions' I'm not referring to transaction isolation levels, which
> involve how information is read, but instead how information is written.
Is this for MVCC? Received on Tue Apr 29 2008 - 04:49:41 CEST