Re: ORA-01578: Datablock corruption

From: Pascal Byrne <byrne_at_icada.com>
Date: Fri, 24 Jan 2003 17:02:21 +0100
Message-ID: <b0rnuo$s1en5$1@ID-147008.news.dfncis.de>

I had heard of the problems some people had with async on AIX but Linux doesn't really have async in the 2.4.x kernel it only simulates it. I don't use multiple dbwriters either so that rules that out.

Was there ever an 8.0.6 release for Linux? I've checked oracle-ftp.oracle.com without luck. If you have the patch, I would be extremly grateful if you could let me know, since we are not in a position to 'force' our customers to do anything :)

Thanks,
Pascal

koert54 wrote:

> I would also recommend upgrading your DB to at least 8.0.6.
> Releases 8.0.4 & 8.0.5 were quite crappy - I've had a lot of corrupted
> blocks especially on 8.0.4 / AIX due to multiple dbwriters & async I/O.
> Furthermore these releases are not supported anymore by Oracle - a valid
>  reason to 'force' your customer into upgrading ...
> 
> 
> "Pascal Byrne" <byrne_at_icada.com> wrote in message
> news:b0qtel$s3l0l$1_at_ID-147008.news.dfncis.de...
>

>>This is an interesting article. The trace events outlined may prove

> 
> useful.
>

>>Thanks,
>>pascal
>>
>>
>>koert54 wrote:
>>
>>>From metalink :
>>>
>>>
>>>
>>>PURPOSE
>>>
>>>This article discusses block corruptions in Oracle and how they are

> 
> related
>

>>>to the underlying operating system and hardware. To better illustrate

> 
> the
> 


>>>discussion, Unix is taken as the operating system of reference.

>>>

>>>SCOPE & APPLICATION

>>>

>>>For users requiring further understanding as to how a block could become

>>>

>>>corrupted.

>>>

>>>

>>>
>>>Block corruption has been a common occurrence on most UNIX based systems

> 
> and
>

>>>relational databases for many years. It is one of the most frequent ways

> 
> to
>

>>>lose data and cause serious business impact. Through a survey of

> 
> literary
> 


>>>technical sources, this document will discuss several ways that block

>>>

>>>corruptions can occur, provide conclusions and possible solutions.

>>>
>>>To fully comprehend all the reasons for block corruption's, it is

> 
> necessary
> 


>>>to

>>>
>>>understand how I/O device subsystems work, how memory buffers are used

> 
> to
>

>>>support the reading and writing of data blocks, how blocks are sized on

> 
> both
>

>>>UNIX and Oracle, and how these three objects work together to maintain

> 
> data
> 


>>>consistency.

>>>
>>>I/O devices are designed specifically for host machines and there have

> 
> been
>

>>>few attempts to standardize a particular interface across the industry.

> 
> Most
>

>>>software, including Oracle, on UNIX machines uses standard C program

> 
> calls
> 


>>>that

>>>
>>>in turn perform system calls to support the reading and writing of data

> 
> to
>

>>>disk. These system calls access I/O device software that retrieves or

> 
> writes
> 


>>>data on disk.

>>>

>>>The UNIX system contains two types of devices, block devices and raw or

>>>
>>>character devices. Block devices look like random access storage devices

> 
> to
> 


>>>the rest of the system while character devices include all other devices

>>>such

>>>

>>>as terminals and network media. (Bach, 1990 314). These device types are

>>>

>>>important to understand because different combinations can increase

>>>corruptions.

>>>
>>>Device drivers are configured by the operating system and the

> 
> configuration
>

>>>procedure generates or populates tables that form part of the code of

> 
> the
> 


>>>kernel. This kernel to device driver interface is described by the block

>>>
>>>device switch table and the character device switch table. Each device

> 
> type
>

>>>has entries in these tables that direct the kernel to the appropriate

> 
> driver
>

>>>interfaces for the system calls. The open and close system calls of a

> 
> device
>

>>>file funnel through the two device switch tables, according to file

> 
> type.
> 


>>>The

>>>

>>>mount and umount system calls also invoke the device open and close

>>>procedures

>>>
>>>for block devices. Read and write system calls of character special

> 
> files
> 


>>>pass

>>>

>>>through the respective procedures in the character device switch tables.

>>>Read

>>>
>>>and write system calls of block devices and of files on mounted file

> 
> systems
>

>>>invokes the algorithms of the buffer cache, which invoke the device

> 
> strategy
>

>>>procedure. (Bach, 1990 314). This buffer cache plays an important role

> 
> in
>

>>>block corruptions since it is the location where data blocks are the

> 
> most
> 


>>>vulnerable.

>>>
>>>The difference between the two disk interfaces is whether they deal with

> 
> the
> 


>>>buffer cache. When accessing the block device interface, the UNIX kernel

>>>

>>>follows the same algorithm as for regular files, except that after

>>>converting

>>>
>>>the logical byte offset into a logical block offset, it treats the

> 
> logical
>

>>>block offset as a physical block number in the file system. It then

> 
> accesses
> 


>>>the data via the buffer cache and, ultimately, the driver strategy

>>>interface.

>>>
>>>However, when accessing the disk via the raw interface, the kernel does

> 
> not
>

>>>convert the byte offset into the file but passes the offset immediately

> 
> to
> 


>>>the

>>>

>>>driver. The driver's read or write routine converts the byte offset to a

>>>

>>>block offset and copies the data directly to the user address space,

>>>bypassing

>>>

>>>kernel buffers.

>>>
>>>Thus, if one process writes a block device and a second process then

> 
> reads a
> 


>>>raw device at the same address, the second process may not read the data

>>>that

>>>
>>>the first process had written, because the data may still be in the

> 
> buffer
> 


>>>cache and not on disk. However, if the second process had read the block

>>>

>>>device, it would automatically pick up the new data, as it exists in the

>>>

>>>buffer cache. (Bach, 1990 328).

>>>
>>>Use of the raw interface may also introduce strange behavior. If a

> 
> process
>

>>>reads or writes a raw device in units smaller than the block size,

> 
> results
> 


>>>are

>>>
>>>driver-dependent. For instance, when issuing 1-byte writes to a tape

> 
> drive,
> 


>>>each byte may appear in different tape blocks. (Bach 1990)

>>>

>>>The advantage of using the raw interface is speed, assuming there is no

>>>

>>>advantage to caching data for later access. Processes accessing block

>>>devices

>>>

>>>transfer blocks of data whose size are constrained by the file system

>>>logical

>>>
>>>block size. Furthermore, use of the block interface entails an extra

> 
> copy of
>

>>>data between user address space and kernel buffers, which is avoided in

> 
> the
> 


>>>raw interface. For example, if a file system has a logical block size 1K

>>>

>>>bytes, at most 1K bytes are transferred per I/O operation. However,

>>>processes

>>>
>>>accessing the disk as a raw device can transfer many disk blocks during

> 
> a
> 


>>>disk

>>>

>>>operation, subject to the capabilities of the disk controller.

>>>
>>>Disk controllers are hardware devices that control the I/O actions of

> 
> one or
> 


>>>more disks. These controllers can also create a bottleneck in a system.

>>>
>>>(Corey, Abbey, Dechichio 1995). Controllers are the most frequent piece

> 
> of
> 


>>>hardware to have and cause problems on many systems. When a system has

>>>
>>>multiple disks controlled by one controller, the results can be fatal.

> 
> The
> 


>>>bottleneck on controllers is a common cause of write error.

>>>
>>>It is important to remember that Oracle and other products use these

> 
> device
>

>>>access methods to perform their work. It is also important to note the

> 
> added
> 


>>>complexity that the Oracle kernel adds to the I/O game.

>>>

>>>The Oracle Relational Database Management System (RDBMS) keeps its

>>>
>>>information, including data, in block format. However, the Oracle data

> 
> block
>

>>>can be, and in most cases is, composed of several operating system

> 
> blocks.
>

>>>An Oracle database block is the physical unit of storage in which all

> 
> Oracle
> 


>>>database data are stored in files. The Oracle database block size is

>>>

>>>determined by setting a parameter called db_block_size when the

>>>

>>>database is created. (Millsap, 1995).

>>>
>>>The most common UNIX block is 512 bytes but the Oracle block size can

> 
> range
>

>>>from 512 to 32K. The difference in block sizing between the operating

> 
> system
>

>>>and the Oracle kernel are beneficial for Oracle; boosting performance

> 
> gains
>

>>>while allowing UNIX to maintain small files with minimal wasted space.

> 
> The
> 


>>>Oracle block can be considered a superset of the UNIX file system block

>>>size.

>>>

>>>Each block of an Oracle data file is formatted with a fixed header that

>>>
>>>contains information about the particular block. This information

> 
> provides a
>

>>>means to ensure the integrity for each block and in turn, the entire

> 
> Oracle
>

>>>database. One component of the fixed header of a data block is called a

> 
> Data
> 


>>>Block Address (DBA). This DBA is a 48 bit integer that stores the file

>>>number

>>>
>>>of the Oracle database and the Oracle block number offset relative to

> 
> the
> 


>>>beginning of the file. (Presley, 1993).

>>>
>>>Whenever there is a problem with the DBA, Oracle will signal an Oracle

> 
> error
>

>>>ORA-00600[3339][arg1][arg2] and possibly ORA-01578: Data block corrupted

> 
> in
> 


>>>file # block #. These errors provide information that point to where the

>>>
>>>corruption exists and can provide several potential causes. (Presley,

> 
> 1993).
> 


>>>The ORA-00600[3339] has two arguments the are meaningful to the person

>>>
>>>evaluating the corruption. Argument 1 is the DBA that Oracle found in

> 
> the
> 


>>>data

>>>
>>>block just read from disk. Argument 2 is the DBA Oracle expected to find

> 
> in
>

>>>the data block it just read. If they are different, the ORA-00600[3339]

> 
> is
> 


>>>signaled.

>>>
>>>Oracle uses the standard C system function calls to read and write

> 
> blocks to
> 


>>>its database files. Once the block has been read it is mapped to shared

>>>
>>>memory by the operating system, After the block has been read into

> 
> shared
>

>>>memory, the Oracle kernel does verification checks on the block to

> 
> ensure
> 


>>>the

>>>
>>>integrity of the fixed header. The DBA check is the first verification

> 
> made
>

>>>on the fixed header. So why do DBAs become corrupt and how can we

> 
> identify
> 


>>>and correct them?

>>>

>>>Case One

>>>

>>>--------

>>>

>>>The first case of block corruption occurs when the first argument of the

>>>
>>>ORA-00600[3339] error has the value of zero while argument two contains

> 
> the
> 


>>>DBA

>>>

>>>which Oracle was trying to retrieve. Remember that argument 1 is the DBA

>>>just

>>>
>>>read from disk. Usually the first operating system block of an Oracle

> 
> block
>

>>>is zeroed out when there was a soft error on disk and the operating

> 
> system
> 


>>>attempted to repair its block. In addition, disk repair utility programs

>>>have

>>>

>>>caused this zeroing out effect.

>>>

>>>Programs that read from and write to the disk directly can destroy the

>>>
>>>consistency of file system data. The file system algorithms coordinate

> 
> disk
> 


>>>I/O operation to maintain a consistent view of disk data structures,

>>>including

>>>

>>>linked lists of free disk blocks and pointer from inodes to direct and

>>>
>>>indirect data blocks. Processes that access the disk directly bypass

> 
> these
> 


>>>if

>>>

>>>they run while other file system activity is going on. For this reason,

>>>these

>>>

>>>programs should not be run on an active file system. (Bach, 1990 328).

>>>
>>>On some early versions of the Oracle RDBMS, a software bug specific to

> 
> UNIX
> 


>>>platforms also caused the ORA-00600[3339]. This bug was part of the code

>>>that

>>>

>>>dealt with multiple database writers.

>>>
>>>A database writer is a background process that is responsible for

> 
> managing
> 


>>>the

>>>
>>>contents of the data block buffer cache and the dictionary cache. It

> 
> reads
> 


>>>the blocks from the datafiles and stores them in the Shared Global Area

>>>
>>>(SGA). The database writer also performs batch writes of changed blocks

> 
> back
>

>>>to the datafiles. (Loney, 1994 23). The SGA is a segment of memory

> 
> allocated
>

>>>to Oracle the contains data and control information particular to an

> 
> Oracle
> 


>>>database instance.

>>>

>>>Using multiple database writers causes multiple background processes to

>>>

>>>perform disk operations at the same time. However, if there are process

>>>
>>>conflicts, incorrect values could be stored and corruption can occur.

> 
> Also,
>

>>>using multiple database writers with asynchronous I/O has been know to

> 
> cause
> 


>>>similar results.

>>>
>>>Asynchronous I/O allows a process to execute a system call to start and

> 
> I/O
>

>>>operation and have the system call return immediately after the

> 
> operation is
>

>>>started or queued. Another system call is required to wait for the

> 
> operation
>

>>>to complete. The advantage of asynchronous I/O is that a process can

> 
> overlap
> 


>>>its execution with I/O, or it can overlap I/O between different devices.

>>>

>>>(Stevens, 1990 163).

>>>

>>>Case Two

>>>

>>>--------

>>>
>>>In the second case, both arguments returned with the ORA-00600[3339]

> 
> error
> 


>>>are

>>>

>>>large numbers. There are several causes that signal this error.

>>>
>>>The DBA in the physical block on disk is incorrect. This can happen if

> 
> the
> 


>>>block was corrupted in memory but was written to disk. This situation is

>>>
>>>quite rare and in most cases it is usually caused by memory faults that

> 
> go
>

>>>undetected. The DBA found in the block is usually garbage and not a

> 
> valid
> 


>>>DBA. Argument two that is returned with the error is always a valid DBA.

>>>

>>>If there is a possibility of memory problems on the system, the database

>>>

>>>administrator can enable further sanity block checking by placing the

>>>
>>>following event parameters in the database instance init.ora parameter

> 
> file:
> 


>>>event = "10210 trace name context forever, level 10"

>>>

>>>event = "10211 trace name context forever, level 10"

>>>

>>>_db_block_cache_protect= true

>>>
>>>These parameters force the Oracle RDBMS kernel to call functions that

> 
> check
>

>>>the block header and block trailer to ensure that the block is in the

> 
> proper
>

>>>format and has not been corrupted. The 10210 event parameter validates

> 
> data
> 


>>>blocks for tables while the 10211 validates data blocks for indexes. The

>>>

>>>_db_block_cache_protect=true protects the cache layer from becoming

>>>corrupted.

>>>

>>>This parameter will prevent certain corruption from getting to disk,

>>>although

>>>

>>>it may crash the foreground of the database instance. It will help catch

>>>

>>>stray writes in the cache. When a process tries to write past the buffer

>>>size

>>>

>>>in the SGA, it will fail first with a stack violation.

>>>
>>>If the database writer process detects a corrupted block in cache prior

> 
> to
>

>>>writing the block to disk, it will signal an ORA-00600[3398] and will

> 
> crash
> 


>>>the

>>>

>>>database instance. The block that is corrupted is never written to disk.

>>>
>>>Various arguments including the DBA are passed to the ORA-00600[3398]

> 
> and
> 


>>>after

>>>
>>>receiving such an error, simply attempt to restart the database

> 
> instance.
> 


>>>There is no doubt that this can be a costly workaround to avoid block

>>>
>>>corruptions. However, the workaround once a corruption has occurred can

> 
> be
> 


>>>even costlier.

>>>
>>>Blocks are sometimes written into the wrong places in the data file.

> 
> This is
>

>>>called "write blocks out of sequence." In this case, both DBAs returned

> 
> with
> 


>>>the ORA-00600[3339] are valid. This typically happens when the operating

>>>system

>>>

>>>I/O device driver fails to write the block in the proper location that

>>>Oracle

>>>

>>>requested via the lseek() system call.

>>>
>>>Some hardware and operating system vendors supports large files or

> 
> "large
> 


>>>file

>>>
>>>systems" that maintain files larger that 4.2 gigabytes. This is larger

> 
> than
> 


>>>what can be represented by a 32 bit unsigned integer. Therefore, the

>>>
>>>operating system must translate the offset transparent to the Oracle

> 
> kernel.
> 


>>>Oracle does not support files larger than 2 gigabytes even though the

>>>

>>>operating system might. On large file systems, the configuration is such

>>>that

>>>

>>>even smaller Oracle data files suffer corruptions caused by blocks being

>>>
>>>written out of sequence because the lseek() system call did not

> 
> translate
> 


>>>the

>>>

>>>correct location. (Velpuri, 1995).

>>>
>>>The lseek() system call is one of the most important calls related to

> 
> block
>

>>>corruption. The calculations that lseek() performs are often the cause

> 
> of
>

>>>block problems. To understand lseek() a brief discussion of byte

> 
> positioning
> 


>>>is necessary.

>>>
>>>Every open file has a "current byte position" associated with it. This

> 
> is
> 


>>>measured as the number of bytes from the start of the file. The create

>>>system

>>>
>>>call sets the file's position to the beginning of the file, as does the

> 
> open
>

>>>system call. The read and write system calls update the file's position

> 
> by
>

>>>the number of bytes read or written. Before a read or write, an open

> 
> file
> 


>>>can

>>>

>>>be positioned using lseek(). The format is:

>>>

>>>lseek(int fildes, long offset, int whence);

>>>
>>>The offset and whence arguments are interpreted as follows: If whence is

> 
> 0,
>

>>>the file's position is set to offset bytes from the beginning of the

> 
> file.
> 


>>>If

>>>

>>>whence is 1, the file's position is set to its current position plus the

>>>
>>>offset. If whence is 2, the file's position is set to the size of the

> 
> file
>

>>>plus the offset. The file's offset can be greater than the file's

> 
> current
>

>>>size, in which case the next write to the file will extend the file.

> 
> Lseek()
> 


>>>returns a long integer byte offset of the file. (Stevens, 1990 40).

>>>

>>>There is great opportunity for miscalculation of an offset based on the

>>>
>>>lseek() system call. Though lseek is not the only system call culprit in

> 
> the
> 


>>>block corruption problem, it is a major contributor.

>>>

>>>Case 3

>>>

>>>------

>>>
>>>A third cause for block corruption is the requested I/O not being

> 
> serviced
> 


>>>by

>>>

>>>the operating system. In this case, both arguments returned from the

>>>

>>>ORA-00600[3339] are valid but the DBA found in argument one is from the

>>>previous

>>>
>>>block read into shared memory prior to the current read request. The

> 
> calls
>

>>>that Oracle makes to lseek() and read() are checked for return error

> 
> codes.
>

>>>In addition, Oracle checks to see the number of bytes read in by the

> 
> read()
>

>>>system call to ensure that the block size or a multiple of the block

> 
> size
> 


>>>was

>>>
>>>read. Since these checks appeared to have been successful, Oracle

> 
> assumes
>

>>>that the direct read succeeded. Upon sanity checking, the DBA is

> 
> incorrect
>

>>>and the database operation request fails. Therefore, the I/O read

> 
> request
>

>>>really never took place. In this case, the DBA found can point to a

> 
> block of
> 


>>>a different file.

>>>

>>>Case 4

>>>

>>>------

>>>
>>>Another reason for block corruption is reading the wrong block from the

> 
> same
>

>>>device. Typically, this is caused by a very busy disk. In some cases,

> 
> the
>

>>>block read was off by 1 block but can range into several hundreds of

> 
> blocks.
>

>>>The DBAs returned with the ORA-00600[3339] are valid DBA's but are not

> 
> the
> 


>>>block

>>>
>>>requested. Since this occurs when the disk is very busy and under lots

> 
> of
>

>>>stress, try spreading datafiles across multiple disks and ensure that

> 
> the
> 


>>>disk

>>>

>>>drive can support the load.

>>>

>>>

>>>

>>>In the third and fourth situations, the database files will not be

>>>physically

>>>

>>>corrupted and the operation can be tried again with success. Most

>>>diagnostics

>>>
>>>testing will not reveal anything wrong with either the operating system

> 
> or
> 


>>>the

>>>

>>>hardware. However, the problem is due to operating system or hardware

>>>related

>>>

>>>problems. (Velpuri, 1995).

>>>
>>>So what causes the operating system calls to behave the way they do and

> 
> how
> 


>>>can companies try to minimize their risk? To evaluate these questions,

>>>

>>>another look into how UNIX works is required.

>>>

>>>UNIX vendors, in a attempt to speed performance, have implemented many

>>>
>>>features into the filesystem. The filesystem manages a large cache of

> 
> I/O
>

>>>buffers, called the buffer cache. This cache allows UNIX to optimize

> 
> read
> 


>>>and

>>>
>>>write operations. When a program writes data, the filesystem stores the

> 
> data
>

>>>in a buffer rather that writing it to disk immediately. At some later

> 
> point
>

>>>in time, the system will send this data to the disk driver, together

> 
> with
> 


>>>other data that has accumulated in the cache. In other words, the buffer

>>>
>>>cache lets the disk driver schedule disk operations in batches. It can

> 
> make
>

>>>larger transfers and use techniques such as seek optimization to make

> 
> disk
> 


>>>access more efficient. This is called write-behind.

>>>
>>>When a program reads data, the system first checks the buffer cache to

> 
> see
> 


>>>if

>>>

>>>the desired data is already there. If the data is already in the buffer

>>>
>>>cache, the filesystem does not need to access the disk for those blocks.

> 
> It
>

>>>just gives the user the data it found in its buffer, eliminating the

> 
> need to
> 


>>>wait for a disk drive. The filesystem only needs to read the disk if the

>>>data

>>>

>>>isn't already in the cache. To increase efficiency even further, the

>>>

>>>filesystem assumes the program will read the file consecutively and read

>>>
>>>several blocks from the disk at once. This increases the likelihood that

> 
> the
>

>>>data for future read operations will already be in the cache. (Loukides,

> 
> M.,
> 


>>>1990) This also increases the chance of block corruption.

>>>
>>>As a filesystem gets busy and buffers are being read, modified, written,

> 
> and
>

>>>aged out of the cache the chance of the kernel reading or writing the

> 
> wrong
>

>>>block increases. Also, the more complex the scheme to read from and

> 
> write to
> 


>>>disk, the greater the likelihood of function failure.

>>>

>>>The UNIX kernel uses the strategy interface to transmit data between the

>>>

>>>buffer cache and a device, although the read and write procedures of

>>>character

>>>
>>>devices sometime use their block counterpart strategy procedure to

> 
> transfer
>

>>>data directly between the device and the user address space. The

> 
> strategy
> 


>>>procedure may queue I/O jobs for a device on a work list or do more

>>>

>>>sophisticated processing to schedule I/O jobs. Drivers can set up data

>>>

>>>transmission for one physical address or many, as appropriate. The UNIX

>>>
>>>kernel passes a buffer header address to the driver strategy procedure.

> 
> The
>

>>>header contains a list of addresses and sizes for transmission of data

> 
> to or
> 


>>>from the device. This is also how the swapping operations work. For the

>>>
>>>buffer cache, the kernel transmits data from one address; when swapping,

> 
> the
>

>>>kernel transmits data from many data addresses. If data is being copied

> 
> to
> 


>>>or

>>>
>>>from the user's address space, the driver must lock the process in

> 
> memory
> 


>>>until the I/O transfer is complete.

>>>
>>>The kernel loses control over a buffer only when it waits for the

> 
> completion
>

>>>of I/O between the buffer and the disk. It is conceivable that a disk

> 
> drive
>

>>>is corrupt so that it cannot interrupt the CPU, preventing the kernel

> 
> from
>

>>>ever releasing the buffer. There are processes that monitor the hardware

> 
> for
>

>>>such cases and zero out the block and return an error to the kernel for

> 
> a
> 


>>>bad

>>>

>>>disk job. (Bach, 1990 52).

>>>
>>>On the UNIX level there are several utilities that will check for bad

> 
> disk
>

>>>blocks and zero out any blocks they find corrupted. These utilities do

> 
> not
>

>>>realize that the block in question may be an Oracle RDBMS block and zero

> 
> out
> 


>>>the block by mistake.

>>>
>>>In (Silberschatz, Galvin, 1994), the authors consider the possible

> 
> effect of
> 


>>>a

>>>
>>>computer crash. In this case, the table of opened files is generally

> 
> lost,
>

>>>and with it any changes in the directories of opened files. This event

> 
> can
>

>>>leave the file system in an inconsistent structure. Frequently, a

> 
> special
>

>>>program is run at reboot time to check for and correct disk

> 
> inconsistencies.
>

>>>The consistency checker compares the data in the directory structure

> 
> with
> 


>>>the

>>>

>>>data blocks on disk, and tries to fix and inconsistencies it finds.

>>>
>>>(Silberschatz, Galvin, 1994) This will often result in the reformatting

> 
> of
> 


>>>blocks which will cause the Oracle block information to be removed. This

>>>will

>>>

>>>definitely cause Oracle corruption.

>>>
>>>It is important to realize that monitoring of hardware is required for

> 
> all
> 


>>>operating systems. Hardware monitors can sense electrical signals on the

>>>
>>>busses and can accurately record them even at high speed. A hardware

> 
> monitor
>

>>>keeps observing the system even when it is malfunctioning, and thus, it

> 
> can
> 


>>>be

>>>
>>>used to debug the system. (Jain, 1991 99) These tools can help determine

> 
> the
> 


>>>cause of the problem and detect problems like controller error and media

>>>

>>>faulting which are frequent corruption contributors.

>>>
>>>In any case, there are many opportunities for blocks, either on disk or

> 
> in
> 


>>>the

>>>
>>>buffer cache, to become corrupt. Fixing the corruption can sometimes

> 
> provide
> 


>>>even greater opportunities.

>>>

>>>

>>>

>>>Conclusion

>>>

>>>----------

>>>

>>>Data block corruption is an ongoing problem on all operating systems,

>>>
>>>especially UNIX. There are many types and causes of corruptions to

> 
> consider.
>

>>>Advanced system configurations can increase the chance and hardware

> 
> problems
>

>>>are a common source of corruptions. When receiving block corruption

> 
> errors,
> 


>>>remember that a couple of them are not physical corruptions but memory

>>>

>>>corruptions that are never written to disk.

>>>
>>>Oracle Customer Support provides a number of bulletins on block

> 
> corruption
> 


>>>problems that help recover what is left of the data once corruption has

>>>
>>>occurred. If block corruption occurs on a machine, be sure to identify

> 
> the
> 


>>>type of corruption and establish a plan for its correction.

>>>

>>>"Pascal Byrne" <byrne_at_icada.com> wrote in message

>>>news:b0mb02$qmmgv$1_at_ID-147008.news.dfncis.de...

>>>

>>>

>>>>Hi,

>>>>One of our customers got this error on his production database

>>>>which is Oracle 8.0.5.1 on SuSE Linux 7.1 (kerne2.4.16-4GB)

>>>>using non-raid removable SCSI disks. Database updates are

>>>>normally by JDBC using the Oracle thin driver and Sun JDK 1.1.8

>>>>

>>>>The error message was:

>>>>ORA-00604: error occurred at recursive SQL level 1

>>>>ORA-01578: ORACLE data block corrupted (file # 2, block # 8596)

>>>>ORA-01110: data file 2: '/ora00/oradata/RSM/rbs01.dbf'

>>>>

>>>>He was lucky the problem happened in a rollback segment tablespace

>>>>rather than one with data since the database is *not* running in

>>>>archivelog mode (don't ask) and they don't make cold datafile

>>>>backups (don't ask)!

>>>>

>>>>I checked the disk for bad blocks with 'badblocks' and it came out

>>>>clean. The machine is in a climate controlled environment so (FAB)

>>>>issues like humidity and temperature don't really come into play.

>>>>

>>>>My boss wants to know the reason why this happend and how *I* can

>>>>prevent it from happening again. Any help would be gratfully accepted.

>>>>

>>>>Thanks,

>>>>Pascal

>>>>

>>>

>>>
>>>

>
> Received on Fri Jan 24 2003 - 10:02:21 CST