PL/SQL Help

From: Kresimir Fabijanic <kresimir_at_ozemail.com.au>
Date: Tue, 01 Dec 1998 23:39:34 +1100
Message-ID: <3663E386.CE4A3842_at_ozemail.com.au>



Dear all

I have an interesting problem. (I did logged a TAR with Oracle, and I am still waiting for an answer, but if anyone experienced similar problem, I would like to know). We have a set of procedures that copy data from central server to remote sites. Each procedure is copying a group of tables. The procedures are run by the Oracle scheduling mechanism, (dba_jobs - using dbms_job.submit etc). Interval of the jobs varies (depending of the frequency of the expected updates on the central server) from 5 seconds to 20 days. The various versions of the application has been running for a while (cca last seven months). In the last two weeks, we have experienced global freeze of one of the jobs. The job hung on most of the remote sites, and was hunging for cca two hours. (same procedure) On some sites only that process hung, while others continued as if nothing happened. On some other sites other jobs were returning with error message (one of the ORA-125nn i think it was 'TNS unable to connect to ... database'). On other sites at least other two jobs hung, while others continued. On all sites (except one) the job that hung returned as if nothing happened, and returned with processed data.

The actual procedure is very simple. Initially it sets local variables, does some local queries, gets a run number (each time process is submitted, it grabs a run number - a sequence so we can identify the messages generated by that run number) writes audit log, and commits. After that process queries one table on remote server to determine if there are any records to copy. If there is, it gets the data from the tables that belons to that particular group. For each table, it writes starting and finishing audit logs. In the end the final audit log is written to the database. If there is no records to retrieve, job skips to the end of the processing and just writes final audit log.

In the situation when the job hung, we had initial audit written (once at cca 22:12 second time at cca 01:13), and then after two hours, (first time at cca 00:11, second time at cca 03:14), first audit of the retrieved table - with the same run number.

Usually when something like this happens (a connection to database is lost or similar problem) process rolls back, an error message is writen to database, and job finishes.

If you need any more information, or know where the problem may be (version of SQL*Net, anything) send me an e-mail, either to my address or to this newsgroup.

We are running central server on AIX 7.3.4.1. Most of the clients are 7.3.3.0 on NT 4.0. We are using SQL Net 2.0 on TCP/IP protocol.

We have checked our alert logs, redo logs, everything on the central and one of the client databases. We have not been able to reproduce it at will.

Kresimir Fabijanic Received on Tue Dec 01 1998 - 13:39:34 CET

Original text of this message