Re: Requesting a RFS walkthrough

From: Jurijs Velikanovs <j.velikanovs_at_gmail.com>
Date: Mon, 15 Jun 2009 15:41:48 +1000
Message-ID: <d6f0def50906142241g19b96b1bx86447efba556dac3_at_mail.gmail.com>



Hi Charles,

Did you find the good explanation on what happened during the time period reflected in allert.log bit you published? My version is as following:
-A- 10:45:02 2009 both RFS processes lost their connections to the processes those has been created by on Primary database. To my understanding RFS processes are LOCAL=NO processes on Standby side, therefore those are created from Primary side.

Just to confirm that you can do the following:



RAN ON STANDBY
SQL> SELECT PID, STATUS,PROCESS,CLIENT_PROCESS,thread#, sequence#, block# FROM V$MANAGED_STANDBY where PROCESS='RFS';

       PID STATUS PROCESS CLIENT_P THREAD# SEQUENCE# BLOCK# ---------- ------------ --------- -------- ---------- ---------- ----------

      6319 IDLE         RFS       LGWR              1        127         14
      6419 IDLE         RFS       UNKNOWN           0          0          0

SQL>
$ ps -fp 6319

oracle 6319 1 0 07:27 ? 00:00:00 oracleGTST (LOCAL=NO)
$ /usr/sbin/lsof -p 6319 | grep TCP

oracle 6319 oracle 18u IPv4 519196 TCP rhel4_1.testing.com:1528->rhel4_1.testing.com:8248 (ESTABLISHED) # As both Standby and Primary are on the same host in my cfg you see both records. In real world case you would see just one on primary side.
$ netstat -tp | grep 8248

tcp 0 0 rhel4_1.testing.com:1528 rhel4_1.testing.com:8248 ESTABLISHED 6319/oracleGTST tcp 0 0 rhel4_1.testing.com:8248 rhel4_1.testing.com:1528 ESTABLISHED 6301/ora_lns1_STST ===== CONCLUSION ===== > 6301/ora_lns1_STST is the farther (on primary) of RFS LGWR process (on secondary)

oracle    6419     1  4 07:27 ?        00:00:11 oracleGTST (LOCAL=NO)
oracle  6419 oracle   18u  IPv4 519550              TCP
rhel4_1.testing.com:1528->rhel4_1.testing.com:8256 (ESTABLISHED)
tcp        0      0 rhel4_1.testing.com:1528
rhel4_1.testing.com:8256    ESTABLISHED 6419/oracleGTST
tcp        0      0 rhel4_1.testing.com:8256
rhel4_1.testing.com:1528 ESTABLISHED 17964/ora_arc1_STST ===== CONCLUSION ===== > 17964/ora_arc1_STST is the farther (on primary) of RFS UNKNOWN process (on secondary)

It might happen for 2 reasons:
1. network interrupted (you said it is unlikely) 2. both processes diapered on Primary side. If this is the case you would find some messages in Primary alert.log file
  • B - 10:45:15 2009 Aster redolog switch Primary database successfully tried to reestablish RFS processes. Those started to ship redolog stream. Standby DB recognized that it missing a bit of REDO stream to continue real time redo processing and contacted Primary for gap resolution using FAL_SERVER parameter.

Those just my 0.02$,
Regards,
Yury

On Sat, Jun 6, 2009 at 4:09 AM, Charles Schultz<sacrophyte_at_gmail.com> wrote:

> Very interesting, but we are not receiving any ora-600 errors. As shown in
> the example, the issue is self-correcting; I am merely trying to understand
> the mechanics at a deeper level.
>
> On Fri, Jun 5, 2009 at 12:20, Goulet, Richard <Richard.Goulet_at_parexel.com>
> wrote:
>>
>> Check out Metalink
>> Doc ID: 386417.1
>>  understanding
>>
>> Dick Goulet
>> Senior Oracle DBA
>> PAREXEL International
>>
>>
>> ________________________________
>> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
>> On Behalf Of Charles Schultz
>> Sent: Friday, June 05, 2009 12:57 PM
>> To: ORACLE-L
>> Subject: Requesting a RFS walkthrough
>>
>> Good day, list,
>> I am curious if someone would be able to divulge some of the under
>> pinnings of RFS. The snippet below is from our alert.log on the standby;
>> this is one of two 10.2.0.2 physical standby databases with no DB Broker
>> using real-time apply, and we have been seeing this situation happen more
>> and more often. Our networking group tells us there are no issues with the
>> infrastructure, so I am struggling to understand why this happening.
>> My basic understanding of RFS explains the situation like so:
>>
>> At 10:45:02, the heartbeat process attempts to talk to Primary but is
>> unable to. I assume there are two similar messages because we have two
>> FAL_SERVERS, but I could be way off in that.
>> The next few verbose messages show a connection being reestablished with
>> the primary.
>> At 10:45:20, FAL attempts to get logseq 6045. Why does it fail? The log
>> was archived on the Primary at 10:45:15 (v$archived_log.completion_time).
>> It fails again at 10:45:55, but succeeds at 10:45:56. Why?
>>
>> Fri Jun  5 10:45:02 2009
>> RFS[143]: Possible network disconnect with primary database
>> Fri Jun  5 10:45:05 2009
>> RFS[144]: Possible network disconnect with primary database
>> Fri Jun  5 10:45:15 2009
>> RFS[142]: Successfully opened standby log 22:
>> '+DATA/gcprod/onlinelog/group_22.352.633713051'
>> Fri Jun  5 10:45:15 2009
>> RFS[139]: Successfully opened standby log 25:
>> '+DATA/gcprod/onlinelog/group_25.262.633713253'
>> Fri Jun  5 10:45:16 2009
>> Redo Shipping Client Connected as PUBLIC
>> -- Connected User is Valid
>> RFS[147]: Assigned to RFS process 18916
>> RFS[147]: Identified database type as 'physical standby'
>> Primary database is in MAXIMUM PERFORMANCE mode
>> Primary database is in MAXIMUM PERFORMANCE mode
>> Fri Jun  5 10:45:16 2009
>> Redo Shipping Client Connected as PUBLIC
>> -- Connected User is Valid
>> RFS[148]: Assigned to RFS process 18918
>> RFS[148]: Identified database type as 'physical standby'
>> Primary database is in MAXIMUM PERFORMANCE mode
>> Fri Jun  5 10:45:16 2009
>> RFS[147]: Successfully opened standby log 21:
>> '+DATA/gcprod/onlinelog/group_21.351.633713041'
>> Fri Jun  5 10:45:16 2009
>> Primary database is in MAXIMUM PERFORMANCE mode
>> RFS[148]: Successfully opened standby log 24:
>> '+DATA/gcprod/onlinelog/group_24.282.633713233'
>> Fri Jun  5 10:45:20 2009
>> Fetching gap sequence in thread 1, gap sequence 6045-6045
>> FAL[client]: Error fetching gap sequence, no FAL server specified
>> Fri Jun  5 10:45:55 2009
>> FAL[client]: Failed to request gap sequence
>>  GAP - thread 1 sequence 6045-6045
>>  DBID 1222165961 branch 631792716
>> FAL[client]: All defined FAL servers have been attempted.
>> -------------------------------------------------------------
>> Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
>> parameter is defined to a value that is sufficiently large
>> enough to maintain adequate log switch information to resolve
>> archivelog gaps.
>> -------------------------------------------------------------
>> Fri Jun  5 10:45:56 2009
>> RFS[145]: Archived Log: '+FRA/gcprod/archivelog/1_6045_631792716.dbf'
>> Fri Jun  5 10:46:30 2009
>> Media Recovery Log +FRA/gcprod/archivelog/1_6045_631792716.dbf
>>
>> --
>> Charles Schultz
>
>
>
> --
> Charles Schultz
>



-- 
Jurijs
+371 29268222 (+2 GMT)
============================================
http://otn.oracle.com/ocm/jvelikanovs.html
--
http://www.freelists.org/webpage/oracle-l
Received on Mon Jun 15 2009 - 00:41:48 CDT

Original text of this message