7.0.16, redo logs, multiple db_writers, HP9000

From: Jonathan Lewis <Jonathan_at_jlcomp.demon.co.uk>
Date: Sat, 21 Jan 1995 20:47:37 +0000
Message-ID: <790721256snz_at_jlcomp.demon.co.uk>


The following note describes a recent problem with a Oracle whereby the LGWR process stopped abruptly and silently, bringing down the entire database.

Hardware:	HP9000/800 series
O/S:		HP-UX 9.04 D
Oracle:		7.0.14

Possible relevant Oracle configuration details:

    ARCHIVELOGMODE
    db_writers=4
    redo logs shadowed in software (2 redo logs per group)

The site DBA needed to tidy up the redo logs which had been rather disrupted by an emergency a few weeks previously; to do this, he added 6 redo log groups of 2 files each, sized at 5M. He used the standard syntax, creating both files in a group simultaneously.

He then used 'alter system switch logfile;' to cycle into the new log files, and 'alter database drop logfile' to get rid of the log files.

The next time a logfile switch occured LGWR stopped, without dumping a trace file, and without any notification appearing in the alert file. PMON, SMON, DBWR, and ARCH all gave up the ghost thereafter, reporting the usual 447/470/471 errors in their trace files.

He was able to restart the database, automatic recovery cut in properly, and everything seemed okay: however, he then issued a 'switch logfile'; and the system crashed again.

In the course of trying to isolate the problem, we crashed our way through all six of the redo logs he had created: by this time, I had switched the database down to 1 db_writer, and added 5 more redo log groups of 1 file each: switching INTO the first of these files still crashed the database, BUT switching out of it into the second one left the database alive.

I dropped the original 6, and the first single file, and cycled the database through the remaining 4 files a couple of types (lots of alter system switch logfile); and the database stayed up.

First, tentative, conclusion:

    When creating pairs of redo logs, when archiving is on,     when using multiple db_writers, then something is wrong     with the redo logs that causes LGWR to fail as it LEAVES     the log file.

You should note that:
a) we had been running with paired redo logs in the past and b) one dba had previously created redo logs in pairs but with

    ONLY ONE db_writer at the time, and
c) the other dba had previously created redo log pairs with

    multiple db_writers, but his method was to create a group     with one file, then add a file to the group.

It all sounds unlikely to me but it's the best I can do at the moment. The only suggestions from Oracle (UK) at the moment is to increase the O/S parameter maxfiles (which we have set rather low); but I can't manage to find enough open files under one process for this to have been the problem.

Has anyone come across anything similar ?

-- 
Jonathan Lewis
Received on Sat Jan 21 1995 - 21:47:37 CET

Original text of this message