OMLET: Get Intimate with Oracle Internals

From: <teraknowledgesystems_at_yahoo.com>
Date: 26 Jun 2005 04:34:19 -0700
Message-ID: <1119785659.549647.160540@g44g2000cwa.googlegroups.com>

BLOCK VERSIONS

BLOCK VERSION MEANING The version of a block for read-consistency purposes is identified by a number (actually a structure) called the "block system commit number"
(BSCN). The BSCN for a given block defines consistent read snapshots
that
are able to use that block, either as-is (if the block has already been rolled back to some point) or after some rollback.

The version of a block for cache re-read purposes is identified by the block's incarnation number/sequence number combination (SEQ). This value
increases monotonically as the block is changed; each change to the block
results in a new SEQ.

Various BSCN's and BSCN ranges are used to describe the characteristics of a given block version, whether it is on disk or in a buffer cache. These BSCN's are used ONLY for determining the suitability of the block for a given consistent read snapshot, and not (for example) determining whether this is the current version of a block (this is done using SEQ). In general, it is difficult to determine the exact, accurate BSCN's for a given block version. It is particularly difficult because the real values may change over time. For example, if a given block on disk contains all changes made in committed transactions (say, an instance holds a modified copy of the block but none of the changes are committed),
the disk block can be used for any consistent snapshot started after the
last committed transaction. But if one of those uncommitted transactions
commits, the disk block can suddenly only be used for consistent snapshots
started before THAT commit. Keeping track of the REAL BSCN or range for a given block is difficult and not something that the cache can always do.

Therefore, the BSCN's and BSCN ranges maintained are really approximations
of the real ones. The combination of the cache layer and the transaction
layer tries to keep the numbers as accurate as possible, but as long as they err on the correct side, the only problem will be failure to use a block which is really usable.

The SEQ, on the other hand, can be maintained exactly because it is a characteristic of the block, not of the transactions involved with the block.

MAINTAINED VERSIONS The following BSCN's and SEQ's are maintained:

DISK BLOCKS: For a block on disk, a BSCN and an SEQ are maintained:

disk_high: the highest BSCN such that any snapshot whose SCN is LESS THAN
OR EQUAL TO this value can use the block (after rollback). May be infinity

if the block contains all committed changes. In the worst case, this value will be set to the current SCN at the time that an instance changes a cache version of the block; however, we can (with some difficulty) do better by (a) deferring modification of the value

until the change is committed; or (b) having the transaction layer compute

a more accurate value at intervals, and updating it to the higher value.

This value is stored in the buffer hash table. It is safe if the value

stored is lower than the "real" value.

disk_seq: the SEQ of the block on disk (could be lower than that of the

block on disk if it is being written). This value is always set by the

instance that locks the block (perhaps also by those who just read it).

Using this value, an instance can determine whether the block it has is

as current as the block on disk (by comparing the SEQ's), and using the

dirty_in_cache bit in the hash value, it can determine whether there is a more recent version in somebody's cache. Because of a shortage of

space in the hash value, we can only store part of the SEQ in disk_seq;

refer to HASH VALUES below.

CACHE BLOCKS: Blocks in the cache fall into several categories, which are
somewhat overlapping.

Current block: A current block is a block for which a lock is held, and is
therefore guaranteed to be the most recent copy of the block. A cache may
contain at most one current version of a block.

Stale block: A stale block is a block which at some point was a current block, but which is no longer guaranteed to be current either because the lock was dropped or because the block was read from disk without acquiring the lock. A block in the cache is either current or stale, but
not both. A current block can become stale if its lock is dropped; a stale block can become current if the lock is reacquired AND it can be determined that the block has not changed (by comparing SEQ). A cache may
contain multiple stale versions of a block, in addition to or instead of a current version.

CR (consistent read) block: A CR block is a block version which contains
a consistent version of the block; in other words, it contains only data
that was committed as of some time (plus, possibly, other changes made by the local transaction). A current or stale block may also be a CR block, if the current or stale block required no rollbacks to be consistent. There may also be CR blocks that are neither current nor stale.

block_high: the highest BSCN such that any snapshot whose SCN is LESS THAN
OR EQUAL TO this value can use the block (after rollback).

current blocks: the value is always infinity (the block can be used

in any snapshot)

      stale blocks: in the worst case, the value is set when the block
      is read to the lower of disc_bscn and the current SCN value; it
      is set when a lock is dropped to the current SCN value (we can do
      better by (a) resetting the value to disk_high anytime we happen
      to notice that block_seq is equal to disk_seq, or (b) having the
      transaction layer compute a more accurate value at intervals, and
      updating it to the higher value)

      CR blocks: the value is set by the transaction layer when it
      constructs the CR block, or establishes an existing current or
      stale block as a CR block

This value is stored in the cache buffer header. It is safe if the value stored is lower than the "real" value.

block_low (ONLY for CR blocks): the lowest value such that snapshots with a time >= block_low can use this block WITHOUT doing any rollback.

This value is set by the transaction layer when it constructs the CR block, or establishes an existing current or stale block as a CR block. The value is stored in the cache buffer header. It is safe if the value stored is higher than the "real" value.

block_seq: the SEQ value for this block (or, for CR blocks, for the block it was derived from). This value is set from the block's header

value when the block is read; for current blocks, it is advanced as the block is changed.

Some other BSCN values may be stored in order to implement the above values.

LOCAL VERSIONS The above approach is adequate for reading blocks changed by other transactions. However, it does not cover the situation where a user reads data that he himself has changed in the same transaction. We allow users to see changes made in statements that completed before the snapshot time, even if those changes are not committed. Therefore, blocks (notably CR blocks) must contain some information that describes the vintage of the block with respect to the user's statements, in addition to the vintage with respect to other transactions.

This is simpler than dealing with other transactions, for two reasons: First, the changes of interest all have occurred on the same instance as the reader (since the user's own changes can only occur on the instance he's connected to), so there is no need to remember version information for blocks that have been written to disk or are in another instance's cache. Second, a given user's statements proceed sequentially,
so expressing the consistent snapshot information is easier than for transactions, which can start, run and commit in parallel. The UBA
(undo

block address) is used to express a point in time in the user's update stream.

However, there is one thing that makes this more complicated, and that is that the version information is local to a given transaction, where the version information for different transactions is global to the system. Keeping all of the information around would require keeping a table of transactions associated with each block; each entry would express, for that transaction, the snapshots in which that block is valid.

To make things manageable, we've adopted a compromise. A given block header
can contain version information for at most one transaction. When a transaction modifies a current block, the cache manager finds all other cached block versions for that block, and updates the version information so
that if this transaction needs to read the block, it knows the limits of its
validity. If the version information has already been used by another transaction, then a bit is set to indicate that a second transaction put a
constraint on the use of the block, but the version information is not available; in this case, no other user who might have modified the block
will be allowed to use the snapshot. To repeat: a transaction ID in the buffer header means that changes were made to the current version of this
block in that transaction, and those changes past the corresponding UBA are
missing from this version; the local_private bit in the buffer header means
that changes were made to the current version of this block in some unknown
transaction, and those changes past some unknown point are missing from this
version.

The following local version information is maintained for each block in the cache:

local_trans: the transaction ID for the transaction to which the remaining local_* fields apply; always empty for current blocks, and the transaction that local_high refers to in stale and CR blocks.

local_high: a UBA such that this block version is guaranteed to contain all changes made by local_trans that were made before UBA.

local_private: a bit which, if set, indicates that some transaction other than local_trans modified the current version of the block on this instance, and this version of the block may be missing some of those changes (this bit is never set unless local_trans is filled in)

As with the BSCN values, these items are set in two different ways. First,
the cache can deduce for itself that they should be set. Whenever the current version of a block is changed, all other versions (stale and CR)
must be marked as follows: If local_trans is not set, it should be set to the changing transaction ID, and local_high should be set to the UBA for the change (these must therefore be passed in to the cache change operation). If local_trans is already set and the transaction is the same, and local_high is greater than the UBA for the change, it should be set to the UBA for the change. If local_trans is already set and the transaction is different, local_private should be set.

Second, the transaction layer can set these values when creating a CR version, using the mark_CR(..., CR_trans, CR_uba, CR_private) operation to
indicate that the block is now missing changes. The transaction layer can specify the validity limit for a single transaction, and can also specify (via the boolean CR_private) that there are other (unknown) validity limits for other (unspecified) transactions. The cache layer will combine this information with other information already in the block (for example, if local_trans is already set, then CR_trans is ignored
and local_private is set even if CR_private is not set).

The cache decides whether a given block can be used for a given CR request as follows:

snapshot !local_trans local_trans && !local_private local_private

--------     ------------     -----------------------------

no trans. use if SCN ok use if SCN ok use if SCN ok

trans. =                              use if SCN               use if
SCN
local_trans                           and UBA ok               and UBA
ok
             use if SCN ok
trans. !=                             use if SCN ok            DO NOT

USE
local_trans

"no trans" means that no changes had been made in the current transaction
when the snapshot time occurred; "trans" means that changes had been made
at that time, and the transaction is the transaction that was current at
that time. "use if XXX ok" means that the block can be used if the block_high
(for SCN) and/or local_high (for UBA) is not violated by the snapshot
time.

HASH VALUES As mentioned before, the SEQ value for a block is stored in the hash value.
Unfortunately, there is not enough room in the hash value for all the stuff we need:

   DBA              4 (block this hash entry refers to)
   dirty_in_cache   1 (redo thread number)
   disk_high        6 (SCN)
   disk_seq         8 (SEQ, 4 incarnation + 4 sequence)
                   --
                   19

In VMS, only sixteen bytes are available. Therefore, we need to make some
compromises. Other than increasing the hash value size by using two locks
for each buffer, which is unattractive because of the extra locking overhead,
there seem to be two solutions:

If we could guarantee that no cache would ever contain blocks with an incarnation number other than the current one, we could eliminate the incarnation number from the hash value, which get us the four bytes we need. This could be guaranteed if we provide some sort of segment locking mechanism that will cause blocks in every instance's cache to be flushed
(writing if necessary) whenever a segment is dropped (or wrapped, in
the case of the before image).
Otherwise, we must shave bytes here and there.

disk_high: We can remove two or three bytes here, and assume that it is within 2^32 or 2^24 of the current system commit number. This is a pretty
good assumption, since disk_high is set to infinity when a block is written,
and blocks are written at each checkpoint. As long as we set the checkpoint
frequency to be faster than the trimmed SCN wrap frequency, there will never
be any error, because a block has a non-infinity SCN only while it is dirty
in some cache.

incarnation: We can remove one byte here, and assume that a cache will never hold a block version with an incarnation number 2^24 away from the
block's current incarnation number. This is a pretty good assumption, though
if we were extraordinarily unlucky the maximizing process of combining freed
extents could cause the incarnation number to jump exactly that amount all
at once.

sequence: We can remove one byte here, and assume that a cache will never hold a block version with a sequence number 2^24 away from the block's
current sequence number. This is a good assumption; a cache would have to
hold a block, without flushing it, for the time it takes the other instances
to make exactly 2^24 changes to the block (this would require pathological
activity on the block, and pathological dormancy in the instance whose cache contains the old version).

So, we could recover a total of four or five bytes if we had to. Since we
only need three, we should probably leave the incarnation/sequence number
alone and cut disk_high to three.

OTHER CACHING NOTES

We must write blocks before making the corresponding hash table BSCN changes, and must read the hash table BSCN values before reading the corresponding blocks. This guarantees that if a block is read out of sync with the BSCN values, that the block is newer and is therefore usable for older snapshots.
The transformation among CR, stale and current versions is a matter of strategy. For example, when a current block is changed, and it has been accessed recently as a CR version, it is probably a good idea to spin off the CR version before making the change. Some hints from higher layers will probably be needed to help make these decisions.
When reading a block in CR mode, the locking of the block is a matter of strategy. Locking the block in share mode allows continued use of the block in subsequent snapshots without rechecking the BSCN (since it is guaranteed current), at least until the lock is dropped, but requires more work at read time. Not locking it causes less work at read time, but the validity of the block is limited and it must be revalidated
(sometimes unsuccessfully) by looking at the hash table value.
Since circular arithmetic is used, a bit, rather than a special value, is needed to represent infinity. We can use dirty_in_cache == NONE to mean "infinity" as long as we don't intend to defer changing disk_high until commit.
None of the hash values, including dirty_in_cache and disk_high, need be updated until the redo log corresponding to the block change is written. Perhaps we could just use the SCN at that time, and keep a list of un-logged blocks that gets emptied (and hash values forced out) whenever some redo is forced. Dealing with blocks that have redo spread over several blocks may be complicated, but a significant advantage is that we may tend to deal with the hash values in batches, which will work much better in disk implementations of the lock manager.

Received on Sun Jun 26 2005 - 06:34:19 CDT