Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.server -> OMLET: Get Intimate with Oracle Internals
BLOCK VERSIONS
BLOCK VERSION MEANING
The version of a block for read-consistency purposes is identified by a
number (actually a structure) called the "block system commit number"
(BSCN). The BSCN for a given block defines consistent read snapshots
that
are able to use that block, either as-is (if the block has already been
rolled back to some point) or after some rollback.
The version of a block for cache re-read purposes is identified by the
block's incarnation number/sequence number combination (SEQ). This
value
increases monotonically as the block is changed; each change to the
block
results in a new SEQ.
Various BSCN's and BSCN ranges are used to describe the characteristics
of a given block version, whether it is on disk or in a buffer cache.
These BSCN's are used ONLY for determining the suitability of the block
for a given consistent read snapshot, and not (for example) determining
whether this is the current version of a block (this is done using
SEQ).
In general, it is difficult to determine the exact, accurate BSCN's for
a given block version. It is particularly difficult because the real
values may change over time. For example, if a given block on disk
contains all changes made in committed transactions (say, an instance
holds a modified copy of the block but none of the changes are
committed),
the disk block can be used for any consistent snapshot started after
the
last committed transaction. But if one of those uncommitted
transactions
commits, the disk block can suddenly only be used for consistent
snapshots
started before THAT commit. Keeping track of the REAL BSCN or range for
a given block is difficult and not something that the cache can always
do.
Therefore, the BSCN's and BSCN ranges maintained are really
approximations
of the real ones. The combination of the cache layer and the
transaction
layer tries to keep the numbers as accurate as possible, but as long as
they err on the correct side, the only problem will be failure to use a
block which is really usable.
The SEQ, on the other hand, can be maintained exactly because it is a characteristic of the block, not of the transactions involved with the block.
MAINTAINED VERSIONS The following BSCN's and SEQ's are maintained:
DISK BLOCKS: For a block on disk, a BSCN and an SEQ are maintained:
disk_high: the highest BSCN such that any snapshot whose SCN is LESS
THAN
OR EQUAL TO this value can use the block (after rollback). May be
infinity
if the block contains all committed changes. In the worst case, this value will be set to the current SCN at the time that an instance changes a cache version of the block; however, we can (with some difficulty) do better by (a) deferring modification of the value
until the change is committed; or (b) having the transaction layer compute
a more accurate value at intervals, and updating it to the higher value.
This value is stored in the buffer hash table. It is safe if the value
stored is lower than the "real" value.
disk_seq: the SEQ of the block on disk (could be lower than that of the
block on disk if it is being written). This value is always set by the
instance that locks the block (perhaps also by those who just read it).
Using this value, an instance can determine whether the block it has is
as current as the block on disk (by comparing the SEQ's), and using the
dirty_in_cache bit in the hash value, it can determine whether there is a more recent version in somebody's cache. Because of a shortage of
space in the hash value, we can only store part of the SEQ in disk_seq;
refer to HASH VALUES below.
CACHE BLOCKS: Blocks in the cache fall into several categories, which
are
somewhat overlapping.
Current block: A current block is a block for which a lock is held, and
is
therefore guaranteed to be the most recent copy of the block. A cache
may
contain at most one current version of a block.
Stale block: A stale block is a block which at some point was a current
block, but which is no longer guaranteed to be current either because
the lock was dropped or because the block was read from disk without
acquiring the lock. A block in the cache is either current or stale,
but
not both. A current block can become stale if its lock is dropped; a
stale block can become current if the lock is reacquired AND it can be
determined that the block has not changed (by comparing SEQ). A cache
may
contain multiple stale versions of a block, in addition to or instead
of a current version.
CR (consistent read) block: A CR block is a block version which
contains
a consistent version of the block; in other words, it contains only
data
that was committed as of some time (plus, possibly, other changes made
by the local transaction). A current or stale block may also be a CR
block, if the current or stale block required no rollbacks to be
consistent. There may also be CR blocks that are neither current nor
stale.
block_high: the highest BSCN such that any snapshot whose SCN is
LESS THAN
OR EQUAL TO this value can use the block (after rollback).
current blocks: the value is always infinity (the block can be used
in any snapshot)
stale blocks: in the worst case, the value is set when the block is read to the lower of disc_bscn and the current SCN value; it is set when a lock is dropped to the current SCN value (we can do better by (a) resetting the value to disk_high anytime we happen to notice that block_seq is equal to disk_seq, or (b) having the transaction layer compute a more accurate value at intervals, and updating it to the higher value) CR blocks: the value is set by the transaction layer when it constructs the CR block, or establishes an existing current or stale block as a CR block
This value is stored in the cache buffer header. It is safe if the value stored is lower than the "real" value.
block_low (ONLY for CR blocks): the lowest value such that snapshots with a time >= block_low can use this block WITHOUT doing any rollback.
This value is set by the transaction layer when it constructs the CR block, or establishes an existing current or stale block as a CR block. The value is stored in the cache buffer header. It is safe if the value stored is higher than the "real" value.
block_seq: the SEQ value for this block (or, for CR blocks, for the block it was derived from). This value is set from the block's header
value when the block is read; for current blocks, it is advanced as the block is changed.
Some other BSCN values may be stored in order to implement the above values.
LOCAL VERSIONS The above approach is adequate for reading blocks changed by other transactions. However, it does not cover the situation where a user reads data that he himself has changed in the same transaction. We allow users to see changes made in statements that completed before the snapshot time, even if those changes are not committed. Therefore, blocks (notably CR blocks) must contain some information that describes the vintage of the block with respect to the user's statements, in addition to the vintage with respect to other transactions.
This is simpler than dealing with other transactions, for two reasons:
First, the changes of interest all have occurred on the same instance
as the reader (since the user's own changes can only occur on the
instance he's connected to), so there is no need to remember version
information for blocks that have been written to disk or are in another
instance's cache. Second, a given user's statements proceed
sequentially,
so expressing the consistent snapshot information is easier than for
transactions, which can start, run and commit in parallel. The UBA
(undo
block address) is used to express a point in time in the user's update
stream.
However, there is one thing that makes this more complicated, and that is that the version information is local to a given transaction, where the version information for different transactions is global to the system. Keeping all of the information around would require keeping a table of transactions associated with each block; each entry would express, for that transaction, the snapshots in which that block is valid.
To make things manageable, we've adopted a compromise. A given block
header
can contain version information for at most one transaction. When a
transaction modifies a current block, the cache manager finds all other
cached block versions for that block, and updates the version
information so
that if this transaction needs to read the block, it knows the limits
of its
validity. If the version information has already been used by another
transaction, then a bit is set to indicate that a second transaction
put a
constraint on the use of the block, but the version information is not
available; in this case, no other user who might have modified the
block
will be allowed to use the snapshot. To repeat: a transaction ID in the
buffer header means that changes were made to the current version of
this
block in that transaction, and those changes past the corresponding UBA
are
missing from this version; the local_private bit in the buffer header
means
that changes were made to the current version of this block in some
unknown
transaction, and those changes past some unknown point are missing from
this
version.
The following local version information is maintained for each block in the cache:
local_trans: the transaction ID for the transaction to which the remaining local_* fields apply; always empty for current blocks, and the transaction that local_high refers to in stale and CR blocks.
local_high: a UBA such that this block version is guaranteed to contain all changes made by local_trans that were made before UBA.
local_private: a bit which, if set, indicates that some transaction other than local_trans modified the current version of the block on this instance, and this version of the block may be missing some of those changes (this bit is never set unless local_trans is filled in)
As with the BSCN values, these items are set in two different ways.
First,
the cache can deduce for itself that they should be set. Whenever the
current version of a block is changed, all other versions (stale and
CR)
must be marked as follows: If local_trans is not set, it should be set
to the changing transaction ID, and local_high should be set to the UBA
for the change (these must therefore be passed in to the cache change
operation). If local_trans is already set and the transaction is the
same, and local_high is greater than the UBA for the change, it should
be set to the UBA for the change. If local_trans is already set and the
transaction is different, local_private should be set.
Second, the transaction layer can set these values when creating a CR
version, using the mark_CR(..., CR_trans, CR_uba, CR_private) operation
to
indicate that the block is now missing changes. The transaction layer
can specify the validity limit for a single transaction, and can also
specify (via the boolean CR_private) that there are other (unknown)
validity limits for other (unspecified) transactions. The cache layer
will combine this information with other information already in the
block (for example, if local_trans is already set, then CR_trans is
ignored
and local_private is set even if CR_private is not set).
The cache decides whether a given block can be used for a given CR request as follows:
snapshot !local_trans local_trans && !local_private local_private
-------- ------------ -----------------------------
no trans. use if SCN ok use if SCN ok use if SCN ok
trans. = use if SCN use if SCN local_trans and UBA ok and UBA ok use if SCN ok trans. != use if SCN ok DO NOTUSE
"no trans" means that no changes had been made in the current
transaction
when the snapshot time occurred; "trans" means that changes had been
made
at that time, and the transaction is the transaction that was current
at
that time. "use if XXX ok" means that the block can be used if the
block_high
(for SCN) and/or local_high (for UBA) is not violated by the snapshot
time.
HASH VALUES
As mentioned before, the SEQ value for a block is stored in the hash
value.
Unfortunately, there is not enough room in the hash value for all the
stuff we need:
DBA 4 (block this hash entry refers to) dirty_in_cache 1 (redo thread number) disk_high 6 (SCN) disk_seq 8 (SEQ, 4 incarnation + 4 sequence) -- 19
In VMS, only sixteen bytes are available. Therefore, we need to make
some
compromises. Other than increasing the hash value size by using two
locks
for each buffer, which is unattractive because of the extra locking
overhead,
there seem to be two solutions:
disk_high: We can remove two or three bytes here, and assume that it is
within 2^32 or 2^24 of the current system commit number. This is a
pretty
good assumption, since disk_high is set to infinity when a block is
written,
and blocks are written at each checkpoint. As long as we set the
checkpoint
frequency to be faster than the trimmed SCN wrap frequency, there will
never
be any error, because a block has a non-infinity SCN only while it is
dirty
in some cache.
incarnation: We can remove one byte here, and assume that a cache will
never hold a block version with an incarnation number 2^24 away from
the
block's current incarnation number. This is a pretty good assumption,
though
if we were extraordinarily unlucky the maximizing process of combining
freed
extents could cause the incarnation number to jump exactly that amount
all
at once.
sequence: We can remove one byte here, and assume that a cache will
never hold a block version with a sequence number 2^24 away from the
block's
current sequence number. This is a good assumption; a cache would have
to
hold a block, without flushing it, for the time it takes the other
instances
to make exactly 2^24 changes to the block (this would require
pathological
activity on the block, and pathological dormancy in the instance whose
cache contains the old version).
So, we could recover a total of four or five bytes if we had to. Since
we
only need three, we should probably leave the incarnation/sequence
number
alone and cut disk_high to three.
OTHER CACHING NOTES