Standby hung following a network disconnect

From: Vasu Rajagopal <vrajagopal_at_fiberlink.com>
Date: Fri, 2 Jul 2010 23:15:00 -0400
Message-ID: <BB2BB2935A9F0B4A993E2477A994C8C543356F9EED_at_EXCHANGE.fiberlink.local>


Hi,

I have a Data Guard issue , It is RAC Production on 10.2.0.4 using LGWR ASYNC redo transport to DR site which is also a RAC configuration (Physical Standby). One of the standby database (in Real time apply mode) is hanging after NETWORK DISCONNECT error, causing few Gb/sec read I/O on Stanby Redo Logs (SRLs) and it seems to be stuck forever. This has occurred almost twice a week in the last 2 months .

As a temporary work-around, Cancelled the managed recovery process (MRP) and put it into ARCH apply mode, that seems to be working , though we would like to have the DR site Running in REAL TIME APPLY mode.

I have got an update from Oracle saying : Most of the issues relating to ora-3135 have ended up being a router / switch / firewall / http protocol issue, asking , If cisco router is used then to disable the fixup protocol for the sqlnet port , etc. However, I am not sure why the LNS/MRP processes are unable to recover and get back into normal mode after detecting the timeout .

Looking for inputs on ways to diagnose/resolve this. Thanks,
Vasu

Here is the excerpt from log files showing disconnect :

Primary DB (mydb) alert log



Errors in file /u001/app/oracle/admin/mydb/bdump/mydb1_lns1_22694.trc: ORA-03135: connection lost contact
Fri Jun 25 16:00:42 2010
LGWR: I/O error 3135 archiving log 2 to '(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=racsrvr11.mycomp.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=dgmydb_XPT.mycomp.com)(INSTANCE_NAME=dgmydb2)(SERVER=dedicated)))'

Primary LNS Trace file --- mydb1_lns1_22694.trc



Sending online log thread 1 seq 14857 [logfile 2] to standby Archiving to destination (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=racsrvr11.mycomp.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=dgmydb_XPT.mycomp.com)(INSTANCE_NAME=dgmydb2)(SERVER=dedicated))) ASYNC blocks=20480 Log file opened [logno 2]
*** 2010-06-25 16:00:42.157

RFS network connection lost at host '(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=racsrvr11.mycomp.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=dgmydb_XPT.mycomp.com)(INSTANCE_NAME=dgmydb2)(SERVER=dedicated)))' Error 3135 writing standby archive log file at host '(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=racsrvr11.mycomp.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=dgmydb_XPT.mycomp.com)(INSTANCE_NAME=dgmydb2)(SERVER=dedicated)))' ORA-03135: connection lost contact
*** 2010-06-25 16:00:42.170 64208 kcrr.c

Standby alert log



Mem# 0: /netapp/oracle/dr/redologs11/dgmydb/group_32.1296.718556259 Fri Jun 25 16:00:42 2010
RFS[2]: Possible network disconnect with primary database Fri Jun 25 18:14:17 2010
Redo Shipping Client Connected as PUBLIC -- Connected User is Valid
RFS[6]: Assigned to RFS process 10755
RFS[6]: Identified database type as 'physical standby'

Fiberlink Disclaimer: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
--
http://www.freelists.org/webpage/oracle-l
Received on Fri Jul 02 2010 - 22:15:00 CDT

Original text of this message