Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Usenet -> c.d.o.server -> Re: ORA-01578: Datablock corruption
I had heard of the problems some people had with async on AIX but Linux
doesn't really have async in the 2.4.x kernel it only simulates it. I
don't use multiple dbwriters either so that rules that out.
Was there ever an 8.0.6 release for Linux? I've checked oracle-ftp.oracle.com without luck. If you have the patch, I would be extremly grateful if you could let me know, since we are not in a position to 'force' our customers to do anything :)
Thanks,
Pascal
koert54 wrote:
> I would also recommend upgrading your DB to at least 8.0.6. > Releases 8.0.4 & 8.0.5 were quite crappy - I've had a lot of corrupted > blocks especially on 8.0.4 / AIX due to multiple dbwriters & async I/O. > Furthermore these releases are not supported anymore by Oracle - a valid > reason to 'force' your customer into upgrading ... > > > "Pascal Byrne" <byrne_at_icada.com> wrote in message > news:b0qtel$s3l0l$1_at_ID-147008.news.dfncis.de... >
> > useful. >
> > related >
> > the >>>>
>>>discussion, Unix is taken as the operating system of reference.
>>>
>>>SCOPE & APPLICATION
>>>
>>>For users requiring further understanding as to how a block could become
>>>
>>>corrupted.
>>>
>>>
> > and >
> > to >
> > literary >>>>
>>>technical sources, this document will discuss several ways that block
>>>
>>>corruptions can occur, provide conclusions and possible solutions.
> > necessary >>>>
>>>to
> > to >
> > both >
> > data >>>>
>>>consistency.
> > been >
> > Most >
> > calls >>>>
>>>that
> > to >
> > writes >>>>
>>>data on disk.
>>>
>>>The UNIX system contains two types of devices, block devices and raw or
> > to >>>>
>>>the rest of the system while character devices include all other devices
>>>such
>>>
>>>as terminals and network media. (Bach, 1990 314). These device types are
>>>
>>>important to understand because different combinations can increase
>>>corruptions.
> > configuration >
> > the >>>>
>>>kernel. This kernel to device driver interface is described by the block
> > type >
> > driver >
> > device >
> > type. >>>>
>>>The
>>>
>>>mount and umount system calls also invoke the device open and close
>>>procedures
> > files >>>>
>>>pass
>>>
>>>through the respective procedures in the character device switch tables.
>>>Read
> > systems >
> > strategy >
> > in >
> > most >>>>
>>>vulnerable.
> > the >>>>
>>>buffer cache. When accessing the block device interface, the UNIX kernel
>>>
>>>follows the same algorithm as for regular files, except that after
>>>converting
> > logical >
> > accesses >>>>
>>>the data via the buffer cache and, ultimately, the driver strategy
>>>interface.
> > not >
> > to >>>>
>>>the
>>>
>>>driver. The driver's read or write routine converts the byte offset to a
>>>
>>>block offset and copies the data directly to the user address space,
>>>bypassing
>>>
>>>kernel buffers.
> > reads a >>>>
>>>raw device at the same address, the second process may not read the data
>>>that
> > buffer >>>>
>>>cache and not on disk. However, if the second process had read the block
>>>
>>>device, it would automatically pick up the new data, as it exists in the
>>>
>>>buffer cache. (Bach, 1990 328).
> > process >
> > results >>>>
>>>are
> > drive, >>>>
>>>each byte may appear in different tape blocks. (Bach 1990)
>>>
>>>The advantage of using the raw interface is speed, assuming there is no
>>>
>>>advantage to caching data for later access. Processes accessing block
>>>devices
>>>
>>>transfer blocks of data whose size are constrained by the file system
>>>logical
> > copy of >
> > the >>>>
>>>raw interface. For example, if a file system has a logical block size 1K
>>>
>>>bytes, at most 1K bytes are transferred per I/O operation. However,
>>>processes
> > a >>>>
>>>disk
>>>
>>>operation, subject to the capabilities of the disk controller.
> > one or >>>>
>>>more disks. These controllers can also create a bottleneck in a system.
> > of >>>>
>>>hardware to have and cause problems on many systems. When a system has
> > The >>>>
>>>bottleneck on controllers is a common cause of write error.
> > device >
> > added >>>>
>>>complexity that the Oracle kernel adds to the I/O game.
>>>
>>>The Oracle Relational Database Management System (RDBMS) keeps its
> > block >
> > blocks. >
> > Oracle >>>>
>>>database data are stored in files. The Oracle database block size is
>>>
>>>determined by setting a parameter called db_block_size when the
>>>
>>>database is created. (Millsap, 1995).
> > range >
> > system >
> > gains >
> > The >>>>
>>>Oracle block can be considered a superset of the UNIX file system block
>>>size.
>>>
>>>Each block of an Oracle data file is formatted with a fixed header that
> > provides a >
> > Oracle >
> > Data >>>>
>>>Block Address (DBA). This DBA is a 48 bit integer that stores the file
>>>number
> > the >>>>
>>>beginning of the file. (Presley, 1993).
> > error >
> > in >>>>
>>>file # block #. These errors provide information that point to where the
> > 1993). >>>>
>>>The ORA-00600[3339] has two arguments the are meaningful to the person
> > the >>>>
>>>data
> > in >
> > is >>>>
>>>signaled.
> > blocks to >>>>
>>>its database files. Once the block has been read it is mapped to shared
> > shared >
> > ensure >>>>
>>>the
> > made >
> > identify >>>>
>>>and correct them?
>>>
>>>Case One
>>>
>>>--------
>>>
>>>The first case of block corruption occurs when the first argument of the
> > the >>>>
>>>DBA
>>>
>>>which Oracle was trying to retrieve. Remember that argument 1 is the DBA
>>>just
> > block >
> > system >>>>
>>>attempted to repair its block. In addition, disk repair utility programs
>>>have
>>>
>>>caused this zeroing out effect.
>>>
>>>Programs that read from and write to the disk directly can destroy the
> > disk >>>>
>>>I/O operation to maintain a consistent view of disk data structures,
>>>including
>>>
>>>linked lists of free disk blocks and pointer from inodes to direct and
> > these >>>>
>>>if
>>>
>>>they run while other file system activity is going on. For this reason,
>>>these
>>>
>>>programs should not be run on an active file system. (Bach, 1990 328).
> > UNIX >>>>
>>>platforms also caused the ORA-00600[3339]. This bug was part of the code
>>>that
>>>
>>>dealt with multiple database writers.
> > managing >>>>
>>>the
> > reads >>>>
>>>the blocks from the datafiles and stores them in the Shared Global Area
> > back >
> > allocated >
> > Oracle >>>>
>>>database instance.
>>>
>>>Using multiple database writers causes multiple background processes to
>>>
>>>perform disk operations at the same time. However, if there are process
> > Also, >
> > cause >>>>
>>>similar results.
> > I/O >
> > operation is >
> > operation >
> > overlap >>>>
>>>its execution with I/O, or it can overlap I/O between different devices.
>>>
>>>(Stevens, 1990 163).
>>>
>>>Case Two
>>>
>>>--------
> > error >>>>
>>>are
>>>
>>>large numbers. There are several causes that signal this error.
> > the >>>>
>>>block was corrupted in memory but was written to disk. This situation is
> > go >
> > valid >>>>
>>>DBA. Argument two that is returned with the error is always a valid DBA.
>>>
>>>If there is a possibility of memory problems on the system, the database
>>>
>>>administrator can enable further sanity block checking by placing the
> > file: >>>>
>>>event = "10210 trace name context forever, level 10"
>>>
>>>event = "10211 trace name context forever, level 10"
>>>
>>>_db_block_cache_protect= true
> > check >
> > proper >
> > data >>>>
>>>blocks for tables while the 10211 validates data blocks for indexes. The
>>>
>>>_db_block_cache_protect=true protects the cache layer from becoming
>>>corrupted.
>>>
>>>This parameter will prevent certain corruption from getting to disk,
>>>although
>>>
>>>it may crash the foreground of the database instance. It will help catch
>>>
>>>stray writes in the cache. When a process tries to write past the buffer
>>>size
>>>
>>>in the SGA, it will fail first with a stack violation.
> > to >
> > crash >>>>
>>>the
>>>
>>>database instance. The block that is corrupted is never written to disk.
> > and >>>>
>>>after
> > instance. >>>>
>>>There is no doubt that this can be a costly workaround to avoid block
> > be >>>>
>>>even costlier.
> > This is >
> > with >>>>
>>>the ORA-00600[3339] are valid. This typically happens when the operating
>>>system
>>>
>>>I/O device driver fails to write the block in the proper location that
>>>Oracle
>>>
>>>requested via the lseek() system call.
> > "large >>>>
>>>file
> > than >>>>
>>>what can be represented by a 32 bit unsigned integer. Therefore, the
> > kernel. >>>>
>>>Oracle does not support files larger than 2 gigabytes even though the
>>>
>>>operating system might. On large file systems, the configuration is such
>>>that
>>>
>>>even smaller Oracle data files suffer corruptions caused by blocks being
> > translate >>>>
>>>the
>>>
>>>correct location. (Velpuri, 1995).
> > block >
> > of >
> > positioning >>>>
>>>is necessary.
> > is >>>>
>>>measured as the number of bytes from the start of the file. The create
>>>system
> > open >
> > by >
> > file >>>>
>>>can
>>>
>>>be positioned using lseek(). The format is:
>>>
>>>lseek(int fildes, long offset, int whence);
> > 0, >
> > file. >>>>
>>>If
>>>
>>>whence is 1, the file's position is set to its current position plus the
> > file >
> > current >
> > Lseek() >>>>
>>>returns a long integer byte offset of the file. (Stevens, 1990 40).
>>>
>>>There is great opportunity for miscalculation of an offset based on the
> > the >>>>
>>>block corruption problem, it is a major contributor.
>>>
>>>Case 3
>>>
>>>------
> > serviced >>>>
>>>by
>>>
>>>the operating system. In this case, both arguments returned from the
>>>
>>>ORA-00600[3339] are valid but the DBA found in argument one is from the
>>>previous
> > calls >
> > codes. >
> > read() >
> > size >>>>
>>>was
> > assumes >
> > incorrect >
> > request >
> > block of >>>>
>>>a different file.
>>>
>>>Case 4
>>>
>>>------
> > same >
> > the >
> > blocks. >
> > the >>>>
>>>block
> > of >
> > the >>>>
>>>disk
>>>
>>>drive can support the load.
>>>
>>>
>>>
>>>In the third and fourth situations, the database files will not be
>>>physically
>>>
>>>corrupted and the operation can be tried again with success. Most
>>>diagnostics
> > or >>>>
>>>the
>>>
>>>hardware. However, the problem is due to operating system or hardware
>>>related
>>>
>>>problems. (Velpuri, 1995).
> > how >>>>
>>>can companies try to minimize their risk? To evaluate these questions,
>>>
>>>another look into how UNIX works is required.
>>>
>>>UNIX vendors, in a attempt to speed performance, have implemented many
> > I/O >
> > read >>>>
>>>and
> > data >
> > point >
> > with >>>>
>>>other data that has accumulated in the cache. In other words, the buffer
> > make >
> > disk >>>>
>>>access more efficient. This is called write-behind.
> > see >>>>
>>>if
>>>
>>>the desired data is already there. If the data is already in the buffer
> > It >
> > need to >>>>
>>>wait for a disk drive. The filesystem only needs to read the disk if the
>>>data
>>>
>>>isn't already in the cache. To increase efficiency even further, the
>>>
>>>filesystem assumes the program will read the file consecutively and read
> > the >
> > M., >>>>
>>>1990) This also increases the chance of block corruption.
> > and >
> > wrong >
> > write to >>>>
>>>disk, the greater the likelihood of function failure.
>>>
>>>The UNIX kernel uses the strategy interface to transmit data between the
>>>
>>>buffer cache and a device, although the read and write procedures of
>>>character
> > transfer >
> > strategy >>>>
>>>procedure may queue I/O jobs for a device on a work list or do more
>>>
>>>sophisticated processing to schedule I/O jobs. Drivers can set up data
>>>
>>>transmission for one physical address or many, as appropriate. The UNIX
> > The >
> > to or >>>>
>>>from the device. This is also how the swapping operations work. For the
> > the >
> > to >>>>
>>>or
> > memory >>>>
>>>until the I/O transfer is complete.
> > completion >
> > drive >
> > from >
> > for >
> > a >>>>
>>>bad
>>>
>>>disk job. (Bach, 1990 52).
> > disk >
> > not >
> > out >>>>
>>>the block by mistake.
> > effect of >>>>
>>>a
> > lost, >
> > can >
> > special >
> > inconsistencies. >
> > with >>>>
>>>the
>>>
>>>data blocks on disk, and tries to fix and inconsistencies it finds.
> > of >>>>
>>>blocks which will cause the Oracle block information to be removed. This
>>>will
>>>
>>>definitely cause Oracle corruption.
> > all >>>>
>>>operating systems. Hardware monitors can sense electrical signals on the
> > monitor >
> > can >>>>
>>>be
> > the >>>>
>>>cause of the problem and detect problems like controller error and media
>>>
>>>faulting which are frequent corruption contributors.
> > in >>>>
>>>the
> > provide >>>>
>>>even greater opportunities.
>>>
>>>
>>>
>>>Conclusion
>>>
>>>----------
>>>
>>>Data block corruption is an ongoing problem on all operating systems,
> > consider. >
> > problems >
> > errors, >>>>
>>>remember that a couple of them are not physical corruptions but memory
>>>
>>>corruptions that are never written to disk.
> > corruption >>>>
>>>problems that help recover what is left of the data once corruption has
> > the >>>>
>>>type of corruption and establish a plan for its correction.
>>>
>>>"Pascal Byrne" <byrne_at_icada.com> wrote in message
>>>news:b0mb02$qmmgv$1_at_ID-147008.news.dfncis.de...
>>>
>>>
>>>>Hi,
>>>>One of our customers got this error on his production database
>>>>which is Oracle 8.0.5.1 on SuSE Linux 7.1 (kerne2.4.16-4GB)
>>>>using non-raid removable SCSI disks. Database updates are
>>>>normally by JDBC using the Oracle thin driver and Sun JDK 1.1.8
>>>>
>>>>The error message was:
>>>>ORA-00604: error occurred at recursive SQL level 1
>>>>ORA-01578: ORACLE data block corrupted (file # 2, block # 8596)
>>>>ORA-01110: data file 2: '/ora00/oradata/RSM/rbs01.dbf'
>>>>
>>>>He was lucky the problem happened in a rollback segment tablespace
>>>>rather than one with data since the database is *not* running in
>>>>archivelog mode (don't ask) and they don't make cold datafile
>>>>backups (don't ask)!
>>>>
>>>>I checked the disk for bad blocks with 'badblocks' and it came out
>>>>clean. The machine is in a climate controlled environment so (FAB)
>>>>issues like humidity and temperature don't really come into play.
>>>>
>>>>My boss wants to know the reason why this happend and how *I* can
>>>>prevent it from happening again. Any help would be gratfully accepted.
>>>>
>>>>Thanks,
>>>>Pascal
>>>>
>>>
![]() |
![]() |