7.2.3 Database Hang on AIX 4.2.1

From: Calvin Wang <calvinw_at_relsol.com>
Date: Wed, 18 Nov 1998 00:42:51 -0500
Message-ID: <72tm85$j4e$1@nr1.toronto.istar.net>

An interesting problem...

I have an Oracle 7.2.3 database that hangs about twice a week. When the database hangs, the alert log do not have any error messages and there are no trace files generated.

Hanging sympton: Database first slows to a crawl, and then it freezes up. Any attempt at establishing a session leaves a connection waiting to connect to the database forever. I can connect internal through svrmgr and sqldba. However, queries on some v$ tables (such as v$log) will also lead to a hang where the database never returns a response. There are no lock waits or deadlocks on the system. The OS is operational and vmstat as well as iostat reveals an idle machine. No oracle processes are accumlating any cpu time with the exception of dbwr. At the time of hangs, the database is being utilized fairly heavily with 200+ OLTP users along with a sprinking with OCI batch applications running concurrently.

Background info: Database originally ran on a AIX 4.2.0 server under same oracle version. The database was migrated to a beefier server running AIX 4.2.1. The old server was running hardware raid 0 + 1 for data files and raid 0 for index files. New server runs on SSA raid 5 for data files and LV mirror for index files. The database ran without problem on new server for 2 weeks, then it hangs once a week, now it hangs twice a week. Number of transactions have not gone up.

Solutions tried so far: After analyzing the oradbx system dump file, Oracle indicated the dbwr is not clearing out the dirty blocks. As a result, I have tried increasing redo logs to 20m (from 5m), increase dbwr to 8 (from 4), set _db_block_write_batch=16 (from default), increase db_block_buffer to equivalent of 70Megs (from 60), log_buffer = 1m (from 512k). Oracle / IBM have implemented patches to the AIX kernel (4.2.1.15), and pw-syscall kernel extension. So far, none of these have worked. IBM have analyzed trace files and indicated nothing out of the ordinary from AIX side.

Solutions being tried: Created script to alter system checkpoint on a 30 minute basis.

Stuff I will try: Since the problem is dirty blocks not being cleared out by dbwr, reduce log_checkpoint_interval to equivalent of 10m (from a huge size), and set log_checkpoints_to_alert to true so alert log will show whether checkpoint is being sent. At present, checkpoint_process is set to true.

Stuff that I cannot try (easily): Oracle suggests upgrading to 7.3.

Questions:
When I query v$sysstat, I get dirty buffers inspected as a pretty large number (200000+). Some of the documents I came across suggests this statistic should be around 0, or else it indicates dbwr is not clearing out dirty buffers fast enough. Can anyone ellaborate on this or offer any suggestions to this problem?

Thanks in advance,
Calvin Wang
calvin.wang_at_dhltd.com Received on Tue Nov 17 1998 - 23:42:51 CST