7.0.16, redo logs, multiple db_writers, HP9000
Date: Sat, 21 Jan 1995 20:47:37 +0000
Message-ID: <790721256snz_at_jlcomp.demon.co.uk>
The following note describes a recent problem with a Oracle whereby the LGWR process stopped abruptly and silently, bringing down the entire database.
Hardware: HP9000/800 series O/S: HP-UX 9.04 D Oracle: 7.0.14
Possible relevant Oracle configuration details:
ARCHIVELOGMODE
db_writers=4
redo logs shadowed in software (2 redo logs per group)
The site DBA needed to tidy up the redo logs which had been rather disrupted by an emergency a few weeks previously; to do this, he added 6 redo log groups of 2 files each, sized at 5M. He used the standard syntax, creating both files in a group simultaneously.
He then used 'alter system switch logfile;' to cycle into the new log files, and 'alter database drop logfile' to get rid of the log files.
The next time a logfile switch occured LGWR stopped, without dumping a trace file, and without any notification appearing in the alert file. PMON, SMON, DBWR, and ARCH all gave up the ghost thereafter, reporting the usual 447/470/471 errors in their trace files.
He was able to restart the database, automatic recovery cut in properly, and everything seemed okay: however, he then issued a 'switch logfile'; and the system crashed again.
In the course of trying to isolate the problem, we crashed our way through all six of the redo logs he had created: by this time, I had switched the database down to 1 db_writer, and added 5 more redo log groups of 1 file each: switching INTO the first of these files still crashed the database, BUT switching out of it into the second one left the database alive.
I dropped the original 6, and the first single file, and cycled the database through the remaining 4 files a couple of types (lots of alter system switch logfile); and the database stayed up.
First, tentative, conclusion:
When creating pairs of redo logs, when archiving is on, when using multiple db_writers, then something is wrong with the redo logs that causes LGWR to fail as it LEAVES the log file.
You should note that:
a) we had been running with paired redo logs in the past and
b) one dba had previously created redo logs in pairs but with
ONLY ONE db_writer at the time, and
c) the other dba had previously created redo log pairs with
multiple db_writers, but his method was to create a group with one file, then add a file to the group.
It all sounds unlikely to me but it's the best I can do at the moment. The only suggestions from Oracle (UK) at the moment is to increase the O/S parameter maxfiles (which we have set rather low); but I can't manage to find enough open files under one process for this to have been the problem.
Has anyone come across anything similar ?
-- Jonathan LewisReceived on Sat Jan 21 1995 - 21:47:37 CET