Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: ORA-01578: Datablock corruption

Re: ORA-01578: Datablock corruption

From: Pascal Byrne <byrne_at_icada.com>
Date: Fri, 24 Jan 2003 17:02:21 +0100
Message-ID: <b0rnuo$s1en5$1@ID-147008.news.dfncis.de>


I had heard of the problems some people had with async on AIX but Linux doesn't really have async in the 2.4.x kernel it only simulates it. I don't use multiple dbwriters either so that rules that out.

Was there ever an 8.0.6 release for Linux? I've checked oracle-ftp.oracle.com without luck. If you have the patch, I would be extremly grateful if you could let me know, since we are not in a position to 'force' our customers to do anything :)

Thanks,
Pascal

koert54 wrote:

> I would also recommend upgrading your DB to at least 8.0.6.
> Releases 8.0.4 & 8.0.5 were quite crappy - I've had a lot of corrupted
> blocks especially on 8.0.4 / AIX due to multiple dbwriters & async I/O.
> Furthermore these releases are not supported anymore by Oracle - a valid
>  reason to 'force' your customer into upgrading ...
> 
> 
> "Pascal Byrne" <byrne_at_icada.com> wrote in message
> news:b0qtel$s3l0l$1_at_ID-147008.news.dfncis.de...
> 

>>This is an interesting article. The trace events outlined may prove
> 
> useful.
> 

>>Thanks,
>>pascal
>>
>>
>>koert54 wrote:
>>
>>>From metalink :
>>>
>>>
>>>
>>>PURPOSE
>>>
>>>This article discusses block corruptions in Oracle and how they are
> 
> related
> 

>>>to the underlying operating system and hardware. To better illustrate
> 
> the
> 

>>>discussion, Unix is taken as the operating system of reference.
>>>
>>>SCOPE & APPLICATION
>>>
>>>For users requiring further understanding as to how a block could become
>>>
>>>corrupted.
>>>
>>>
>>>
>>>Block corruption has been a common occurrence on most UNIX based systems
> 
> and
> 

>>>relational databases for many years. It is one of the most frequent ways
> 
> to
> 

>>>lose data and cause serious business impact. Through a survey of
> 
> literary
> 

>>>technical sources, this document will discuss several ways that block
>>>
>>>corruptions can occur, provide conclusions and possible solutions.
>>>
>>>To fully comprehend all the reasons for block corruption's, it is
> 
> necessary
> 

>>>to
>>>
>>>understand how I/O device subsystems work, how memory buffers are used
> 
> to
> 

>>>support the reading and writing of data blocks, how blocks are sized on
> 
> both
> 

>>>UNIX and Oracle, and how these three objects work together to maintain
> 
> data
> 

>>>consistency.
>>>
>>>I/O devices are designed specifically for host machines and there have
> 
> been
> 

>>>few attempts to standardize a particular interface across the industry.
> 
> Most
> 

>>>software, including Oracle, on UNIX machines uses standard C program
> 
> calls
> 

>>>that
>>>
>>>in turn perform system calls to support the reading and writing of data
> 
> to
> 

>>>disk. These system calls access I/O device software that retrieves or
> 
> writes
> 

>>>data on disk.
>>>
>>>The UNIX system contains two types of devices, block devices and raw or
>>>
>>>character devices. Block devices look like random access storage devices
> 
> to
> 

>>>the rest of the system while character devices include all other devices
>>>such
>>>
>>>as terminals and network media. (Bach, 1990 314). These device types are
>>>
>>>important to understand because different combinations can increase
>>>corruptions.
>>>
>>>Device drivers are configured by the operating system and the
> 
> configuration
> 

>>>procedure generates or populates tables that form part of the code of
> 
> the
> 

>>>kernel. This kernel to device driver interface is described by the block
>>>
>>>device switch table and the character device switch table. Each device
> 
> type
> 

>>>has entries in these tables that direct the kernel to the appropriate
> 
> driver
> 

>>>interfaces for the system calls. The open and close system calls of a
> 
> device
> 

>>>file funnel through the two device switch tables, according to file
> 
> type.
> 

>>>The
>>>
>>>mount and umount system calls also invoke the device open and close
>>>procedures
>>>
>>>for block devices. Read and write system calls of character special
> 
> files
> 

>>>pass
>>>
>>>through the respective procedures in the character device switch tables.
>>>Read
>>>
>>>and write system calls of block devices and of files on mounted file
> 
> systems
> 

>>>invokes the algorithms of the buffer cache, which invoke the device
> 
> strategy
> 

>>>procedure. (Bach, 1990 314). This buffer cache plays an important role
> 
> in
> 

>>>block corruptions since it is the location where data blocks are the
> 
> most
> 

>>>vulnerable.
>>>
>>>The difference between the two disk interfaces is whether they deal with
> 
> the
> 

>>>buffer cache. When accessing the block device interface, the UNIX kernel
>>>
>>>follows the same algorithm as for regular files, except that after
>>>converting
>>>
>>>the logical byte offset into a logical block offset, it treats the
> 
> logical
> 

>>>block offset as a physical block number in the file system. It then
> 
> accesses
> 

>>>the data via the buffer cache and, ultimately, the driver strategy
>>>interface.
>>>
>>>However, when accessing the disk via the raw interface, the kernel does
> 
> not
> 

>>>convert the byte offset into the file but passes the offset immediately
> 
> to
> 

>>>the
>>>
>>>driver. The driver's read or write routine converts the byte offset to a
>>>
>>>block offset and copies the data directly to the user address space,
>>>bypassing
>>>
>>>kernel buffers.
>>>
>>>Thus, if one process writes a block device and a second process then
> 
> reads a
> 

>>>raw device at the same address, the second process may not read the data
>>>that
>>>
>>>the first process had written, because the data may still be in the
> 
> buffer
> 

>>>cache and not on disk. However, if the second process had read the block
>>>
>>>device, it would automatically pick up the new data, as it exists in the
>>>
>>>buffer cache. (Bach, 1990 328).
>>>
>>>Use of the raw interface may also introduce strange behavior. If a
> 
> process
> 

>>>reads or writes a raw device in units smaller than the block size,
> 
> results
> 

>>>are
>>>
>>>driver-dependent. For instance, when issuing 1-byte writes to a tape
> 
> drive,
> 

>>>each byte may appear in different tape blocks. (Bach 1990)
>>>
>>>The advantage of using the raw interface is speed, assuming there is no
>>>
>>>advantage to caching data for later access. Processes accessing block
>>>devices
>>>
>>>transfer blocks of data whose size are constrained by the file system
>>>logical
>>>
>>>block size. Furthermore, use of the block interface entails an extra
> 
> copy of
> 

>>>data between user address space and kernel buffers, which is avoided in
> 
> the
> 

>>>raw interface. For example, if a file system has a logical block size 1K
>>>
>>>bytes, at most 1K bytes are transferred per I/O operation. However,
>>>processes
>>>
>>>accessing the disk as a raw device can transfer many disk blocks during
> 
> a
> 

>>>disk
>>>
>>>operation, subject to the capabilities of the disk controller.
>>>
>>>Disk controllers are hardware devices that control the I/O actions of
> 
> one or
> 

>>>more disks. These controllers can also create a bottleneck in a system.
>>>
>>>(Corey, Abbey, Dechichio 1995). Controllers are the most frequent piece
> 
> of
> 

>>>hardware to have and cause problems on many systems. When a system has
>>>
>>>multiple disks controlled by one controller, the results can be fatal.
> 
> The
> 

>>>bottleneck on controllers is a common cause of write error.
>>>
>>>It is important to remember that Oracle and other products use these
> 
> device
> 

>>>access methods to perform their work. It is also important to note the
> 
> added
> 

>>>complexity that the Oracle kernel adds to the I/O game.
>>>
>>>The Oracle Relational Database Management System (RDBMS) keeps its
>>>
>>>information, including data, in block format. However, the Oracle data
> 
> block
> 

>>>can be, and in most cases is, composed of several operating system
> 
> blocks.
> 

>>>An Oracle database block is the physical unit of storage in which all
> 
> Oracle
> 

>>>database data are stored in files. The Oracle database block size is
>>>
>>>determined by setting a parameter called db_block_size when the
>>>
>>>database is created. (Millsap, 1995).
>>>
>>>The most common UNIX block is 512 bytes but the Oracle block size can
> 
> range
> 

>>>from 512 to 32K. The difference in block sizing between the operating
> 
> system
> 

>>>and the Oracle kernel are beneficial for Oracle; boosting performance
> 
> gains
> 

>>>while allowing UNIX to maintain small files with minimal wasted space.
> 
> The
> 

>>>Oracle block can be considered a superset of the UNIX file system block
>>>size.
>>>
>>>Each block of an Oracle data file is formatted with a fixed header that
>>>
>>>contains information about the particular block. This information
> 
> provides a
> 

>>>means to ensure the integrity for each block and in turn, the entire
> 
> Oracle
> 

>>>database. One component of the fixed header of a data block is called a
> 
> Data
> 

>>>Block Address (DBA). This DBA is a 48 bit integer that stores the file
>>>number
>>>
>>>of the Oracle database and the Oracle block number offset relative to
> 
> the
> 

>>>beginning of the file. (Presley, 1993).
>>>
>>>Whenever there is a problem with the DBA, Oracle will signal an Oracle
> 
> error
> 

>>>ORA-00600[3339][arg1][arg2] and possibly ORA-01578: Data block corrupted
> 
> in
> 

>>>file # block #. These errors provide information that point to where the
>>>
>>>corruption exists and can provide several potential causes. (Presley,
> 
> 1993).
> 

>>>The ORA-00600[3339] has two arguments the are meaningful to the person
>>>
>>>evaluating the corruption. Argument 1 is the DBA that Oracle found in
> 
> the
> 

>>>data
>>>
>>>block just read from disk. Argument 2 is the DBA Oracle expected to find
> 
> in
> 

>>>the data block it just read. If they are different, the ORA-00600[3339]
> 
> is
> 

>>>signaled.
>>>
>>>Oracle uses the standard C system function calls to read and write
> 
> blocks to
> 

>>>its database files. Once the block has been read it is mapped to shared
>>>
>>>memory by the operating system, After the block has been read into
> 
> shared
> 

>>>memory, the Oracle kernel does verification checks on the block to
> 
> ensure
> 

>>>the
>>>
>>>integrity of the fixed header. The DBA check is the first verification
> 
> made
> 

>>>on the fixed header. So why do DBAs become corrupt and how can we
> 
> identify
> 

>>>and correct them?
>>>
>>>Case One
>>>
>>>--------
>>>
>>>The first case of block corruption occurs when the first argument of the
>>>
>>>ORA-00600[3339] error has the value of zero while argument two contains
> 
> the
> 

>>>DBA
>>>
>>>which Oracle was trying to retrieve. Remember that argument 1 is the DBA
>>>just
>>>
>>>read from disk. Usually the first operating system block of an Oracle
> 
> block
> 

>>>is zeroed out when there was a soft error on disk and the operating
> 
> system
> 

>>>attempted to repair its block. In addition, disk repair utility programs
>>>have
>>>
>>>caused this zeroing out effect.
>>>
>>>Programs that read from and write to the disk directly can destroy the
>>>
>>>consistency of file system data. The file system algorithms coordinate
> 
> disk
> 

>>>I/O operation to maintain a consistent view of disk data structures,
>>>including
>>>
>>>linked lists of free disk blocks and pointer from inodes to direct and
>>>
>>>indirect data blocks. Processes that access the disk directly bypass
> 
> these
> 

>>>if
>>>
>>>they run while other file system activity is going on. For this reason,
>>>these
>>>
>>>programs should not be run on an active file system. (Bach, 1990 328).
>>>
>>>On some early versions of the Oracle RDBMS, a software bug specific to
> 
> UNIX
> 

>>>platforms also caused the ORA-00600[3339]. This bug was part of the code
>>>that
>>>
>>>dealt with multiple database writers.
>>>
>>>A database writer is a background process that is responsible for
> 
> managing
> 

>>>the
>>>
>>>contents of the data block buffer cache and the dictionary cache. It
> 
> reads
> 

>>>the blocks from the datafiles and stores them in the Shared Global Area
>>>
>>>(SGA). The database writer also performs batch writes of changed blocks
> 
> back
> 

>>>to the datafiles. (Loney, 1994 23). The SGA is a segment of memory
> 
> allocated
> 

>>>to Oracle the contains data and control information particular to an
> 
> Oracle
> 

>>>database instance.
>>>
>>>Using multiple database writers causes multiple background processes to
>>>
>>>perform disk operations at the same time. However, if there are process
>>>
>>>conflicts, incorrect values could be stored and corruption can occur.
> 
> Also,
> 

>>>using multiple database writers with asynchronous I/O has been know to
> 
> cause
> 

>>>similar results.
>>>
>>>Asynchronous I/O allows a process to execute a system call to start and
> 
> I/O
> 

>>>operation and have the system call return immediately after the
> 
> operation is
> 

>>>started or queued. Another system call is required to wait for the
> 
> operation
> 

>>>to complete. The advantage of asynchronous I/O is that a process can
> 
> overlap
> 

>>>its execution with I/O, or it can overlap I/O between different devices.
>>>
>>>(Stevens, 1990 163).
>>>
>>>Case Two
>>>
>>>--------
>>>
>>>In the second case, both arguments returned with the ORA-00600[3339]
> 
> error
> 

>>>are
>>>
>>>large numbers. There are several causes that signal this error.
>>>
>>>The DBA in the physical block on disk is incorrect. This can happen if
> 
> the
> 

>>>block was corrupted in memory but was written to disk. This situation is
>>>
>>>quite rare and in most cases it is usually caused by memory faults that
> 
> go
> 

>>>undetected. The DBA found in the block is usually garbage and not a
> 
> valid
> 

>>>DBA. Argument two that is returned with the error is always a valid DBA.
>>>
>>>If there is a possibility of memory problems on the system, the database
>>>
>>>administrator can enable further sanity block checking by placing the
>>>
>>>following event parameters in the database instance init.ora parameter
> 
> file:
> 

>>>event = "10210 trace name context forever, level 10"
>>>
>>>event = "10211 trace name context forever, level 10"
>>>
>>>_db_block_cache_protect= true
>>>
>>>These parameters force the Oracle RDBMS kernel to call functions that
> 
> check
> 

>>>the block header and block trailer to ensure that the block is in the
> 
> proper
> 

>>>format and has not been corrupted. The 10210 event parameter validates
> 
> data
> 

>>>blocks for tables while the 10211 validates data blocks for indexes. The
>>>
>>>_db_block_cache_protect=true protects the cache layer from becoming
>>>corrupted.
>>>
>>>This parameter will prevent certain corruption from getting to disk,
>>>although
>>>
>>>it may crash the foreground of the database instance. It will help catch
>>>
>>>stray writes in the cache. When a process tries to write past the buffer
>>>size
>>>
>>>in the SGA, it will fail first with a stack violation.
>>>
>>>If the database writer process detects a corrupted block in cache prior
> 
> to
> 

>>>writing the block to disk, it will signal an ORA-00600[3398] and will
> 
> crash
> 

>>>the
>>>
>>>database instance. The block that is corrupted is never written to disk.
>>>
>>>Various arguments including the DBA are passed to the ORA-00600[3398]
> 
> and
> 

>>>after
>>>
>>>receiving such an error, simply attempt to restart the database
> 
> instance.
> 

>>>There is no doubt that this can be a costly workaround to avoid block
>>>
>>>corruptions. However, the workaround once a corruption has occurred can
> 
> be
> 

>>>even costlier.
>>>
>>>Blocks are sometimes written into the wrong places in the data file.
> 
> This is
> 

>>>called "write blocks out of sequence." In this case, both DBAs returned
> 
> with
> 

>>>the ORA-00600[3339] are valid. This typically happens when the operating
>>>system
>>>
>>>I/O device driver fails to write the block in the proper location that
>>>Oracle
>>>
>>>requested via the lseek() system call.
>>>
>>>Some hardware and operating system vendors supports large files or
> 
> "large
> 

>>>file
>>>
>>>systems" that maintain files larger that 4.2 gigabytes. This is larger
> 
> than
> 

>>>what can be represented by a 32 bit unsigned integer. Therefore, the
>>>
>>>operating system must translate the offset transparent to the Oracle
> 
> kernel.
> 

>>>Oracle does not support files larger than 2 gigabytes even though the
>>>
>>>operating system might. On large file systems, the configuration is such
>>>that
>>>
>>>even smaller Oracle data files suffer corruptions caused by blocks being
>>>
>>>written out of sequence because the lseek() system call did not
> 
> translate
> 

>>>the
>>>
>>>correct location. (Velpuri, 1995).
>>>
>>>The lseek() system call is one of the most important calls related to
> 
> block
> 

>>>corruption. The calculations that lseek() performs are often the cause
> 
> of
> 

>>>block problems. To understand lseek() a brief discussion of byte
> 
> positioning
> 

>>>is necessary.
>>>
>>>Every open file has a "current byte position" associated with it. This
> 
> is
> 

>>>measured as the number of bytes from the start of the file. The create
>>>system
>>>
>>>call sets the file's position to the beginning of the file, as does the
> 
> open
> 

>>>system call. The read and write system calls update the file's position
> 
> by
> 

>>>the number of bytes read or written. Before a read or write, an open
> 
> file
> 

>>>can
>>>
>>>be positioned using lseek(). The format is:
>>>
>>>lseek(int fildes, long offset, int whence);
>>>
>>>The offset and whence arguments are interpreted as follows: If whence is
> 
> 0,
> 

>>>the file's position is set to offset bytes from the beginning of the
> 
> file.
> 

>>>If
>>>
>>>whence is 1, the file's position is set to its current position plus the
>>>
>>>offset. If whence is 2, the file's position is set to the size of the
> 
> file
> 

>>>plus the offset. The file's offset can be greater than the file's
> 
> current
> 

>>>size, in which case the next write to the file will extend the file.
> 
> Lseek()
> 

>>>returns a long integer byte offset of the file. (Stevens, 1990 40).
>>>
>>>There is great opportunity for miscalculation of an offset based on the
>>>
>>>lseek() system call. Though lseek is not the only system call culprit in
> 
> the
> 

>>>block corruption problem, it is a major contributor.
>>>
>>>Case 3
>>>
>>>------
>>>
>>>A third cause for block corruption is the requested I/O not being
> 
> serviced
> 

>>>by
>>>
>>>the operating system. In this case, both arguments returned from the
>>>
>>>ORA-00600[3339] are valid but the DBA found in argument one is from the
>>>previous
>>>
>>>block read into shared memory prior to the current read request. The
> 
> calls
> 

>>>that Oracle makes to lseek() and read() are checked for return error
> 
> codes.
> 

>>>In addition, Oracle checks to see the number of bytes read in by the
> 
> read()
> 

>>>system call to ensure that the block size or a multiple of the block
> 
> size
> 

>>>was
>>>
>>>read. Since these checks appeared to have been successful, Oracle
> 
> assumes
> 

>>>that the direct read succeeded. Upon sanity checking, the DBA is
> 
> incorrect
> 

>>>and the database operation request fails. Therefore, the I/O read
> 
> request
> 

>>>really never took place. In this case, the DBA found can point to a
> 
> block of
> 

>>>a different file.
>>>
>>>Case 4
>>>
>>>------
>>>
>>>Another reason for block corruption is reading the wrong block from the
> 
> same
> 

>>>device. Typically, this is caused by a very busy disk. In some cases,
> 
> the
> 

>>>block read was off by 1 block but can range into several hundreds of
> 
> blocks.
> 

>>>The DBAs returned with the ORA-00600[3339] are valid DBA's but are not
> 
> the
> 

>>>block
>>>
>>>requested. Since this occurs when the disk is very busy and under lots
> 
> of
> 

>>>stress, try spreading datafiles across multiple disks and ensure that
> 
> the
> 

>>>disk
>>>
>>>drive can support the load.
>>>
>>>
>>>
>>>In the third and fourth situations, the database files will not be
>>>physically
>>>
>>>corrupted and the operation can be tried again with success. Most
>>>diagnostics
>>>
>>>testing will not reveal anything wrong with either the operating system
> 
> or
> 

>>>the
>>>
>>>hardware. However, the problem is due to operating system or hardware
>>>related
>>>
>>>problems. (Velpuri, 1995).
>>>
>>>So what causes the operating system calls to behave the way they do and
> 
> how
> 

>>>can companies try to minimize their risk? To evaluate these questions,
>>>
>>>another look into how UNIX works is required.
>>>
>>>UNIX vendors, in a attempt to speed performance, have implemented many
>>>
>>>features into the filesystem. The filesystem manages a large cache of
> 
> I/O
> 

>>>buffers, called the buffer cache. This cache allows UNIX to optimize
> 
> read
> 

>>>and
>>>
>>>write operations. When a program writes data, the filesystem stores the
> 
> data
> 

>>>in a buffer rather that writing it to disk immediately. At some later
> 
> point
> 

>>>in time, the system will send this data to the disk driver, together
> 
> with
> 

>>>other data that has accumulated in the cache. In other words, the buffer
>>>
>>>cache lets the disk driver schedule disk operations in batches. It can
> 
> make
> 

>>>larger transfers and use techniques such as seek optimization to make
> 
> disk
> 

>>>access more efficient. This is called write-behind.
>>>
>>>When a program reads data, the system first checks the buffer cache to
> 
> see
> 

>>>if
>>>
>>>the desired data is already there. If the data is already in the buffer
>>>
>>>cache, the filesystem does not need to access the disk for those blocks.
> 
> It
> 

>>>just gives the user the data it found in its buffer, eliminating the
> 
> need to
> 

>>>wait for a disk drive. The filesystem only needs to read the disk if the
>>>data
>>>
>>>isn't already in the cache. To increase efficiency even further, the
>>>
>>>filesystem assumes the program will read the file consecutively and read
>>>
>>>several blocks from the disk at once. This increases the likelihood that
> 
> the
> 

>>>data for future read operations will already be in the cache. (Loukides,
> 
> M.,
> 

>>>1990) This also increases the chance of block corruption.
>>>
>>>As a filesystem gets busy and buffers are being read, modified, written,
> 
> and
> 

>>>aged out of the cache the chance of the kernel reading or writing the
> 
> wrong
> 

>>>block increases. Also, the more complex the scheme to read from and
> 
> write to
> 

>>>disk, the greater the likelihood of function failure.
>>>
>>>The UNIX kernel uses the strategy interface to transmit data between the
>>>
>>>buffer cache and a device, although the read and write procedures of
>>>character
>>>
>>>devices sometime use their block counterpart strategy procedure to
> 
> transfer
> 

>>>data directly between the device and the user address space. The
> 
> strategy
> 

>>>procedure may queue I/O jobs for a device on a work list or do more
>>>
>>>sophisticated processing to schedule I/O jobs. Drivers can set up data
>>>
>>>transmission for one physical address or many, as appropriate. The UNIX
>>>
>>>kernel passes a buffer header address to the driver strategy procedure.
> 
> The
> 

>>>header contains a list of addresses and sizes for transmission of data
> 
> to or
> 

>>>from the device. This is also how the swapping operations work. For the
>>>
>>>buffer cache, the kernel transmits data from one address; when swapping,
> 
> the
> 

>>>kernel transmits data from many data addresses. If data is being copied
> 
> to
> 

>>>or
>>>
>>>from the user's address space, the driver must lock the process in
> 
> memory
> 

>>>until the I/O transfer is complete.
>>>
>>>The kernel loses control over a buffer only when it waits for the
> 
> completion
> 

>>>of I/O between the buffer and the disk. It is conceivable that a disk
> 
> drive
> 

>>>is corrupt so that it cannot interrupt the CPU, preventing the kernel
> 
> from
> 

>>>ever releasing the buffer. There are processes that monitor the hardware
> 
> for
> 

>>>such cases and zero out the block and return an error to the kernel for
> 
> a
> 

>>>bad
>>>
>>>disk job. (Bach, 1990 52).
>>>
>>>On the UNIX level there are several utilities that will check for bad
> 
> disk
> 

>>>blocks and zero out any blocks they find corrupted. These utilities do
> 
> not
> 

>>>realize that the block in question may be an Oracle RDBMS block and zero
> 
> out
> 

>>>the block by mistake.
>>>
>>>In (Silberschatz, Galvin, 1994), the authors consider the possible
> 
> effect of
> 

>>>a
>>>
>>>computer crash. In this case, the table of opened files is generally
> 
> lost,
> 

>>>and with it any changes in the directories of opened files. This event
> 
> can
> 

>>>leave the file system in an inconsistent structure. Frequently, a
> 
> special
> 

>>>program is run at reboot time to check for and correct disk
> 
> inconsistencies.
> 

>>>The consistency checker compares the data in the directory structure
> 
> with
> 

>>>the
>>>
>>>data blocks on disk, and tries to fix and inconsistencies it finds.
>>>
>>>(Silberschatz, Galvin, 1994) This will often result in the reformatting
> 
> of
> 

>>>blocks which will cause the Oracle block information to be removed. This
>>>will
>>>
>>>definitely cause Oracle corruption.
>>>
>>>It is important to realize that monitoring of hardware is required for
> 
> all
> 

>>>operating systems. Hardware monitors can sense electrical signals on the
>>>
>>>busses and can accurately record them even at high speed. A hardware
> 
> monitor
> 

>>>keeps observing the system even when it is malfunctioning, and thus, it
> 
> can
> 

>>>be
>>>
>>>used to debug the system. (Jain, 1991 99) These tools can help determine
> 
> the
> 

>>>cause of the problem and detect problems like controller error and media
>>>
>>>faulting which are frequent corruption contributors.
>>>
>>>In any case, there are many opportunities for blocks, either on disk or
> 
> in
> 

>>>the
>>>
>>>buffer cache, to become corrupt. Fixing the corruption can sometimes
> 
> provide
> 

>>>even greater opportunities.
>>>
>>>
>>>
>>>Conclusion
>>>
>>>----------
>>>
>>>Data block corruption is an ongoing problem on all operating systems,
>>>
>>>especially UNIX. There are many types and causes of corruptions to
> 
> consider.
> 

>>>Advanced system configurations can increase the chance and hardware
> 
> problems
> 

>>>are a common source of corruptions. When receiving block corruption
> 
> errors,
> 

>>>remember that a couple of them are not physical corruptions but memory
>>>
>>>corruptions that are never written to disk.
>>>
>>>Oracle Customer Support provides a number of bulletins on block
> 
> corruption
> 

>>>problems that help recover what is left of the data once corruption has
>>>
>>>occurred. If block corruption occurs on a machine, be sure to identify
> 
> the
> 

>>>type of corruption and establish a plan for its correction.
>>>
>>>"Pascal Byrne" <byrne_at_icada.com> wrote in message
>>>news:b0mb02$qmmgv$1_at_ID-147008.news.dfncis.de...
>>>
>>>
>>>>Hi,
>>>>One of our customers got this error on his production database
>>>>which is Oracle 8.0.5.1 on SuSE Linux 7.1 (kerne2.4.16-4GB)
>>>>using non-raid removable SCSI disks. Database updates are
>>>>normally by JDBC using the Oracle thin driver and Sun JDK 1.1.8
>>>>
>>>>The error message was:
>>>>ORA-00604: error occurred at recursive SQL level 1
>>>>ORA-01578: ORACLE data block corrupted (file # 2, block # 8596)
>>>>ORA-01110: data file 2: '/ora00/oradata/RSM/rbs01.dbf'
>>>>
>>>>He was lucky the problem happened in a rollback segment tablespace
>>>>rather than one with data since the database is *not* running in
>>>>archivelog mode (don't ask) and they don't make cold datafile
>>>>backups (don't ask)!
>>>>
>>>>I checked the disk for bad blocks with 'badblocks' and it came out
>>>>clean. The machine is in a climate controlled environment so (FAB)
>>>>issues like humidity and temperature don't really come into play.
>>>>
>>>>My boss wants to know the reason why this happend and how *I* can
>>>>prevent it from happening again. Any help would be gratfully accepted.
>>>>
>>>>Thanks,
>>>>Pascal
>>>>
>>>
>>>
>>>

>
> Received on Fri Jan 24 2003 - 10:02:21 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US