Unexplained Oracle deaths

From: Paul Mailman <pamailman_at_qed.com>
Date: 1996/01/29
Message-ID: <11d7cc$f1426.3ca_at_www.qed.com>#1/1


Minor repost:

On a customer's production system, one of our Oracle client processes has been intermittenly, unexplainedly, dying without a trace. Various Oracle processes have also been dying, leaving evidence in the Oracle alert file indicating an Oracle internal error (ORA 0600) and trace files containing a code that Oracle support informs me indicates that communication between the user process and the Oracle shadow process has gotten out of synch. No clue as to why.

Oracle processes are dying even when ours don't, so my belief is that something is causing the Oracle processes to die and that is killing our client process, rather than vice versa.

Oracle has rejected the bug report with an "Unable to replicate".

The system is running Oracle 7.1.6 on a multiprocessor DEC Alpha under OSF/1. Our process is using OCI to communicate the server. Our process has limited multi-threading, but the code that does OCI calls is mutex-protected because the 7.1 OCI library is not thread safe.

Our process and the server are on the same machine. There are additional clients (Forms 4 apps) running on Windows NT platforms elsewhere. We've had no complaints of spontaneous abortions on the NT platform.

Neither the customer's development group nor we have observed similar on our own systems running the same client software.

Various questions:

  • Sound familiar to anyone?
  • Anyone have any experience with multi-processor Alpha's? Any OS tuning or Oracle tuning which, if not done correctly, might cause phenonmena like this? (The one system on which this is happening is the only multi-processor Alpha among the bunch.)
  • This started occurring a couple of months ago. It has been increasing in frequency a the system has been becoming more heavily loaded. (There have been no ovious time-of-day pattern to the failures to correlate with known times of heavier usage, though.) Anyone familiar with load-related problems that might surface in this way?
  • Are there any non-intrusive monitoring tools that might help look for resource problems that could be causing the Oracle processes to die?
  • My theory of the immediate cause for our process' demise is that when the Oracle process dies the IPC communication that underlies the OCI client/server fails, and that when our process makes further attempt to communicate with the server, it's writing to a broken pipe and receives a SIGPIPE signal that it is unprepared to handle. (Our process establishes no signal handlers, but I have evidence that some library function(s) that we pull in, probably RPC-related, do.) Does anyone know if in fact OCI communication is based on pipes? Does this seem feasible? Any pointers to more information about the underlying communications mechanisms of OCI on this platform?
Received on Mon Jan 29 1996 - 00:00:00 CET

Original text of this message