Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.server -> Re: ORA-01578: Datablock corruption
From metalink :
PURPOSE This article discusses block corruptions in Oracle and how they are related
to the underlying operating system and hardware. To better illustrate the
discussion, Unix is taken as the operating system of reference.
SCOPE & APPLICATION For users requiring further understanding as to how a block could become
corrupted.
Block corruption has been a common occurrence on most UNIX based systems and
relational databases for many years. It is one of the most frequent ways to
lose data and cause serious business impact. Through a survey of literary
technical sources, this document will discuss several ways that block
corruptions can occur, provide conclusions and possible solutions.
To fully comprehend all the reasons for block corruption's, it is necessary to
understand how I/O device subsystems work, how memory buffers are used to
support the reading and writing of data blocks, how blocks are sized on both
UNIX and Oracle, and how these three objects work together to maintain data
consistency.
I/O devices are designed specifically for host machines and there have been
few attempts to standardize a particular interface across the industry. Most
software, including Oracle, on UNIX machines uses standard C program calls that
in turn perform system calls to support the reading and writing of data to
disk. These system calls access I/O device software that retrieves or writes
data on disk.
The UNIX system contains two types of devices, block devices and raw or
character devices. Block devices look like random access storage devices to
the rest of the system while character devices include all other devices such
as terminals and network media. (Bach, 1990 314). These device types are
important to understand because different combinations can increase corruptions.
Device drivers are configured by the operating system and the configuration
procedure generates or populates tables that form part of the code of the
kernel. This kernel to device driver interface is described by the block
device switch table and the character device switch table. Each device type
has entries in these tables that direct the kernel to the appropriate driver
interfaces for the system calls. The open and close system calls of a device
file funnel through the two device switch tables, according to file type. The
mount and umount system calls also invoke the device open and close procedures
for block devices. Read and write system calls of character special files pass
through the respective procedures in the character device switch tables. Read
and write system calls of block devices and of files on mounted file systems
invokes the algorithms of the buffer cache, which invoke the device strategy
procedure. (Bach, 1990 314). This buffer cache plays an important role in
block corruptions since it is the location where data blocks are the most
vulnerable.
The difference between the two disk interfaces is whether they deal with the
buffer cache. When accessing the block device interface, the UNIX kernel
follows the same algorithm as for regular files, except that after converting
the logical byte offset into a logical block offset, it treats the logical
block offset as a physical block number in the file system. It then accesses
the data via the buffer cache and, ultimately, the driver strategy interface.
However, when accessing the disk via the raw interface, the kernel does not
convert the byte offset into the file but passes the offset immediately to the
driver. The driver's read or write routine converts the byte offset to a
block offset and copies the data directly to the user address space, bypassing
kernel buffers.
Thus, if one process writes a block device and a second process then reads a
raw device at the same address, the second process may not read the data that
the first process had written, because the data may still be in the buffer
cache and not on disk. However, if the second process had read the block
device, it would automatically pick up the new data, as it exists in the
buffer cache. (Bach, 1990 328).
Use of the raw interface may also introduce strange behavior. If a process
reads or writes a raw device in units smaller than the block size, results are
driver-dependent. For instance, when issuing 1-byte writes to a tape drive,
each byte may appear in different tape blocks. (Bach 1990)
The advantage of using the raw interface is speed, assuming there is no
advantage to caching data for later access. Processes accessing block devices
transfer blocks of data whose size are constrained by the file system logical
block size. Furthermore, use of the block interface entails an extra copy of
data between user address space and kernel buffers, which is avoided in the
raw interface. For example, if a file system has a logical block size 1K
bytes, at most 1K bytes are transferred per I/O operation. However, processes
accessing the disk as a raw device can transfer many disk blocks during a disk
operation, subject to the capabilities of the disk controller.
Disk controllers are hardware devices that control the I/O actions of one or
more disks. These controllers can also create a bottleneck in a system.
(Corey, Abbey, Dechichio 1995). Controllers are the most frequent piece of
hardware to have and cause problems on many systems. When a system has
multiple disks controlled by one controller, the results can be fatal. The
bottleneck on controllers is a common cause of write error.
It is important to remember that Oracle and other products use these device
access methods to perform their work. It is also important to note the added
complexity that the Oracle kernel adds to the I/O game.
The Oracle Relational Database Management System (RDBMS) keeps its
information, including data, in block format. However, the Oracle data block
can be, and in most cases is, composed of several operating system blocks.
An Oracle database block is the physical unit of storage in which all Oracle
database data are stored in files. The Oracle database block size is
determined by setting a parameter called db_block_size when the
database is created. (Millsap, 1995).
The most common UNIX block is 512 bytes but the Oracle block size can range
from 512 to 32K. The difference in block sizing between the operating system
and the Oracle kernel are beneficial for Oracle; boosting performance gains
while allowing UNIX to maintain small files with minimal wasted space. The
Oracle block can be considered a superset of the UNIX file system block size.
Each block of an Oracle data file is formatted with a fixed header that
contains information about the particular block. This information provides a
means to ensure the integrity for each block and in turn, the entire Oracle
database. One component of the fixed header of a data block is called a Data
Block Address (DBA). This DBA is a 48 bit integer that stores the file number
of the Oracle database and the Oracle block number offset relative to the
beginning of the file. (Presley, 1993).
Whenever there is a problem with the DBA, Oracle will signal an Oracle error
ORA-00600[3339][arg1][arg2] and possibly ORA-01578: Data block corrupted in
file # block #. These errors provide information that point to where the
corruption exists and can provide several potential causes. (Presley, 1993).
The ORA-00600[3339] has two arguments the are meaningful to the person
evaluating the corruption. Argument 1 is the DBA that Oracle found in the data
block just read from disk. Argument 2 is the DBA Oracle expected to find in
the data block it just read. If they are different, the ORA-00600[3339] is
signaled.
Oracle uses the standard C system function calls to read and write blocks to
its database files. Once the block has been read it is mapped to shared
memory by the operating system, After the block has been read into shared
memory, the Oracle kernel does verification checks on the block to ensure the
integrity of the fixed header. The DBA check is the first verification made
on the fixed header. So why do DBAs become corrupt and how can we identify
and correct them?
Case One
The first case of block corruption occurs when the first argument of the
ORA-00600[3339] error has the value of zero while argument two contains the DBA which Oracle was trying to retrieve. Remember that argument 1 is the DBA just
read from disk. Usually the first operating system block of an Oracle block
is zeroed out when there was a soft error on disk and the operating system
attempted to repair its block. In addition, disk repair utility programs have
caused this zeroing out effect.
Programs that read from and write to the disk directly can destroy the
consistency of file system data. The file system algorithms coordinate disk
I/O operation to maintain a consistent view of disk data structures, including
linked lists of free disk blocks and pointer from inodes to direct and
indirect data blocks. Processes that access the disk directly bypass these if
they run while other file system activity is going on. For this reason, these
programs should not be run on an active file system. (Bach, 1990 328).
On some early versions of the Oracle RDBMS, a software bug specific to UNIX
platforms also caused the ORA-00600[3339]. This bug was part of the code that
dealt with multiple database writers.
A database writer is a background process that is responsible for managing the
contents of the data block buffer cache and the dictionary cache. It reads
the blocks from the datafiles and stores them in the Shared Global Area
(SGA). The database writer also performs batch writes of changed blocks back
to the datafiles. (Loney, 1994 23). The SGA is a segment of memory allocated
to Oracle the contains data and control information particular to an Oracle
database instance.
Using multiple database writers causes multiple background processes to
perform disk operations at the same time. However, if there are process
conflicts, incorrect values could be stored and corruption can occur. Also,
using multiple database writers with asynchronous I/O has been know to cause
similar results.
Asynchronous I/O allows a process to execute a system call to start and I/O
operation and have the system call return immediately after the operation is
started or queued. Another system call is required to wait for the operation
to complete. The advantage of asynchronous I/O is that a process can overlap
its execution with I/O, or it can overlap I/O between different devices.
(Stevens, 1990 163).
Case Two
In the second case, both arguments returned with the ORA-00600[3339] error are
large numbers. There are several causes that signal this error.
The DBA in the physical block on disk is incorrect. This can happen if the
block was corrupted in memory but was written to disk. This situation is
quite rare and in most cases it is usually caused by memory faults that go
undetected. The DBA found in the block is usually garbage and not a valid
DBA. Argument two that is returned with the error is always a valid DBA.
If there is a possibility of memory problems on the system, the database
administrator can enable further sanity block checking by placing the
following event parameters in the database instance init.ora parameter file:
event = "10210 trace name context forever, level 10"
event = "10211 trace name context forever, level 10"
_db_block_cache_protect= true
These parameters force the Oracle RDBMS kernel to call functions that check
the block header and block trailer to ensure that the block is in the proper
format and has not been corrupted. The 10210 event parameter validates data
blocks for tables while the 10211 validates data blocks for indexes. The
_db_block_cache_protect=true protects the cache layer from becoming corrupted.
This parameter will prevent certain corruption from getting to disk, although
it may crash the foreground of the database instance. It will help catch
stray writes in the cache. When a process tries to write past the buffer size
in the SGA, it will fail first with a stack violation.
If the database writer process detects a corrupted block in cache prior to
writing the block to disk, it will signal an ORA-00600[3398] and will crash the
database instance. The block that is corrupted is never written to disk.
Various arguments including the DBA are passed to the ORA-00600[3398] and after
receiving such an error, simply attempt to restart the database instance.
There is no doubt that this can be a costly workaround to avoid block
corruptions. However, the workaround once a corruption has occurred can be
even costlier.
Blocks are sometimes written into the wrong places in the data file. This is
called "write blocks out of sequence." In this case, both DBAs returned with
the ORA-00600[3339] are valid. This typically happens when the operating system
I/O device driver fails to write the block in the proper location that Oracle
requested via the lseek() system call.
Some hardware and operating system vendors supports large files or "large file
systems" that maintain files larger that 4.2 gigabytes. This is larger than
what can be represented by a 32 bit unsigned integer. Therefore, the
operating system must translate the offset transparent to the Oracle kernel.
Oracle does not support files larger than 2 gigabytes even though the
operating system might. On large file systems, the configuration is such that
even smaller Oracle data files suffer corruptions caused by blocks being
written out of sequence because the lseek() system call did not translate the
correct location. (Velpuri, 1995).
The lseek() system call is one of the most important calls related to block
corruption. The calculations that lseek() performs are often the cause of
block problems. To understand lseek() a brief discussion of byte positioning
is necessary.
Every open file has a "current byte position" associated with it. This is
measured as the number of bytes from the start of the file. The create system
call sets the file's position to the beginning of the file, as does the open
system call. The read and write system calls update the file's position by
the number of bytes read or written. Before a read or write, an open file can
be positioned using lseek(). The format is:
lseek(int fildes, long offset, int whence);
The offset and whence arguments are interpreted as follows: If whence is 0,
the file's position is set to offset bytes from the beginning of the file. If
whence is 1, the file's position is set to its current position plus the
offset. If whence is 2, the file's position is set to the size of the file
plus the offset. The file's offset can be greater than the file's current
size, in which case the next write to the file will extend the file. Lseek()
returns a long integer byte offset of the file. (Stevens, 1990 40).
There is great opportunity for miscalculation of an offset based on the
lseek() system call. Though lseek is not the only system call culprit in the
block corruption problem, it is a major contributor.
Case 3
A third cause for block corruption is the requested I/O not being serviced by
the operating system. In this case, both arguments returned from the
ORA-00600[3339] are valid but the DBA found in argument one is from the previous
block read into shared memory prior to the current read request. The calls
that Oracle makes to lseek() and read() are checked for return error codes.
In addition, Oracle checks to see the number of bytes read in by the read()
system call to ensure that the block size or a multiple of the block size was
read. Since these checks appeared to have been successful, Oracle assumes
that the direct read succeeded. Upon sanity checking, the DBA is incorrect
and the database operation request fails. Therefore, the I/O read request
really never took place. In this case, the DBA found can point to a block of
a different file.
Case 4
Another reason for block corruption is reading the wrong block from the same
device. Typically, this is caused by a very busy disk. In some cases, the
block read was off by 1 block but can range into several hundreds of blocks.
The DBAs returned with the ORA-00600[3339] are valid DBA's but are not the block
requested. Since this occurs when the disk is very busy and under lots of
stress, try spreading datafiles across multiple disks and ensure that the disk
drive can support the load.
In the third and fourth situations, the database files will not be physically
corrupted and the operation can be tried again with success. Most diagnostics
testing will not reveal anything wrong with either the operating system or the
hardware. However, the problem is due to operating system or hardware related
problems. (Velpuri, 1995).
So what causes the operating system calls to behave the way they do and how
can companies try to minimize their risk? To evaluate these questions,
another look into how UNIX works is required.
UNIX vendors, in a attempt to speed performance, have implemented many
features into the filesystem. The filesystem manages a large cache of I/O
buffers, called the buffer cache. This cache allows UNIX to optimize read and
write operations. When a program writes data, the filesystem stores the data
in a buffer rather that writing it to disk immediately. At some later point
in time, the system will send this data to the disk driver, together with
other data that has accumulated in the cache. In other words, the buffer
cache lets the disk driver schedule disk operations in batches. It can make
larger transfers and use techniques such as seek optimization to make disk
access more efficient. This is called write-behind.
When a program reads data, the system first checks the buffer cache to see if
the desired data is already there. If the data is already in the buffer
cache, the filesystem does not need to access the disk for those blocks. It
just gives the user the data it found in its buffer, eliminating the need to
wait for a disk drive. The filesystem only needs to read the disk if the data
isn't already in the cache. To increase efficiency even further, the
filesystem assumes the program will read the file consecutively and read
several blocks from the disk at once. This increases the likelihood that the
data for future read operations will already be in the cache. (Loukides, M.,
As a filesystem gets busy and buffers are being read, modified, written, and
aged out of the cache the chance of the kernel reading or writing the wrong
block increases. Also, the more complex the scheme to read from and write to
disk, the greater the likelihood of function failure.
The UNIX kernel uses the strategy interface to transmit data between the
buffer cache and a device, although the read and write procedures of character
devices sometime use their block counterpart strategy procedure to transfer
data directly between the device and the user address space. The strategy
procedure may queue I/O jobs for a device on a work list or do more
sophisticated processing to schedule I/O jobs. Drivers can set up data
transmission for one physical address or many, as appropriate. The UNIX
kernel passes a buffer header address to the driver strategy procedure. The
header contains a list of addresses and sizes for transmission of data to or
from the device. This is also how the swapping operations work. For the
buffer cache, the kernel transmits data from one address; when swapping, the
kernel transmits data from many data addresses. If data is being copied to or
from the user's address space, the driver must lock the process in memory
until the I/O transfer is complete.
The kernel loses control over a buffer only when it waits for the completion
of I/O between the buffer and the disk. It is conceivable that a disk drive
is corrupt so that it cannot interrupt the CPU, preventing the kernel from
ever releasing the buffer. There are processes that monitor the hardware for
such cases and zero out the block and return an error to the kernel for a bad
disk job. (Bach, 1990 52).
On the UNIX level there are several utilities that will check for bad disk
blocks and zero out any blocks they find corrupted. These utilities do not
realize that the block in question may be an Oracle RDBMS block and zero out
the block by mistake.
In (Silberschatz, Galvin, 1994), the authors consider the possible effect of a
computer crash. In this case, the table of opened files is generally lost,
and with it any changes in the directories of opened files. This event can
leave the file system in an inconsistent structure. Frequently, a special
program is run at reboot time to check for and correct disk inconsistencies.
The consistency checker compares the data in the directory structure with the
data blocks on disk, and tries to fix and inconsistencies it finds.
(Silberschatz, Galvin, 1994) This will often result in the reformatting of
blocks which will cause the Oracle block information to be removed. This will
definitely cause Oracle corruption.
It is important to realize that monitoring of hardware is required for all
operating systems. Hardware monitors can sense electrical signals on the
busses and can accurately record them even at high speed. A hardware monitor
keeps observing the system even when it is malfunctioning, and thus, it can be
used to debug the system. (Jain, 1991 99) These tools can help determine the
cause of the problem and detect problems like controller error and media
faulting which are frequent corruption contributors.
In any case, there are many opportunities for blocks, either on disk or in the
buffer cache, to become corrupt. Fixing the corruption can sometimes provide
even greater opportunities.
Conclusion
Data block corruption is an ongoing problem on all operating systems,
especially UNIX. There are many types and causes of corruptions to consider.
Advanced system configurations can increase the chance and hardware problems
are a common source of corruptions. When receiving block corruption errors,
remember that a couple of them are not physical corruptions but memory
corruptions that are never written to disk.
Oracle Customer Support provides a number of bulletins on block corruption
problems that help recover what is left of the data once corruption has
occurred. If block corruption occurs on a machine, be sure to identify the
type of corruption and establish a plan for its correction.
"Pascal Byrne" <byrne_at_icada.com> wrote in message
news:b0mb02$qmmgv$1_at_ID-147008.news.dfncis.de...
> Hi,
> One of our customers got this error on his production database
> which is Oracle 8.0.5.1 on SuSE Linux 7.1 (kerne2.4.16-4GB)
> using non-raid removable SCSI disks. Database updates are
> normally by JDBC using the Oracle thin driver and Sun JDK 1.1.8
>
> The error message was:
> ORA-00604: error occurred at recursive SQL level 1
> ORA-01578: ORACLE data block corrupted (file # 2, block # 8596)
> ORA-01110: data file 2: '/ora00/oradata/RSM/rbs01.dbf'
>
> He was lucky the problem happened in a rollback segment tablespace
> rather than one with data since the database is *not* running in
> archivelog mode (don't ask) and they don't make cold datafile
> backups (don't ask)!
>
> I checked the disk for bad blocks with 'badblocks' and it came out
> clean. The machine is in a climate controlled environment so (FAB)
> issues like humidity and temperature don't really come into play.
>
> My boss wants to know the reason why this happend and how *I* can
> prevent it from happening again. Any help would be gratfully accepted.
>
> Thanks,
> Pascal
>
Received on Wed Jan 22 2003 - 09:52:30 CST