Re: Doubt about timeout between nodes of cluster

From: Riyaj Shamsudeen <riyaj.shamsudeen_at_gmail.com>
Date: Thu, 12 Jun 2008 14:57:17 -0500
Message-ID: <48517F9D.3010806@gmail.com>


Hello Waldirio

   Breaking up crsd.log, Approximately 30 seconds spent on CLSC recv/send failure etc. Parameter css misscount is set to 30 in unix platforms. I would say, misscount is controlling this duration, but that need to be validated enabling further trace and looking at cssd.log etc.., if you want.

2008-06-12 14:19:15.781: [ OCRMSG][1484962144]prom_rpc: CLSC recv failure..ret code 7
2008-06-12 14:19:42.464: [ OCRMSG][1484962144]prom_rpc: CLSC send failure..ret code 6

  Another 26 seconds spent in Cluster reconfiguration below..

2008-06-12 14:19:46.036: [ OCRSRV][2541411904]proath_init: Failed to retrieve pubdata. Expect a rcfg
2008-06-12 14:20:12.283: [ OCRMAS][1210108256]th_master:12: I AM THE NEW OCR MASTER at incar 1. Node Number 1

  Changing these parameters have profound effect on availability especially if the network architecture is not good enough.   

Cheers
Riyaj Shamsudeen
The Pythian Group www.pythian.com <http://www.pythian.com/> Personal blog: orainternals.wordpress.com <http://orainternals.wordpress.com/>

Waldirio Manhães Pinheiro wrote:
> Hello Friend
>
> Thank you for answer .., let's check.
>
> 2008/6/12, Riyaj Shamsudeen <riyaj.shamsudeen_at_gmail.com
> <mailto:riyaj.shamsudeen_at_gmail.com>>:
>
> Hello Waldirio
> >> the time to the first machine detect the second machine
> powered off is very big (between 1 and 2 min),
> How are you measuring this time? Are you checking alert log or
> are you using DB connections to check it?
>
>
> I was check this time starting when I have been send the shutdown
> to server until the second VIP interface up on second node (backup node).
>
> Can you also send crsd.log?
>
>
> Ok, following the address because the size
> ... http://rafb.net/p/hqE13995.html
>
> When I send the power off on first node, on second node (crsd log on
> link above), on line 1 log the message "[
> COMMCRS][1147169120]clsc_receive: (0xc6d180) Error receiving, ns
> (12535, 12560), transport (505, 110, 0)" and still "Connection not
> active" until line 2045.
>
> PS: Now, my VIP address of first node don't migrated to second node
> later power off ... (maybe will be necessary re-install the OS and
> Oracle ClusterWare, because I've changed the system a lot of to test)
>
> Further, refer $CRS_HOME/bin/racgvip and there are few parameters
> such as check interval, restart attempts etc controlling behavior
> of VIP failover too. Not sure, they are applicable when machine is
> rebooted since heartbeat will fail before vip check..
>
>
> Yes, I checked this file too, but don't changed.
>
> Now, looking the crsd log file, I believe the Oracle know when another
> node is out, but who is responsible to make a failover (mount the
> aliases of VIP on another machine) !? (Script, Daemon, Angel :P )
>
> Thank you friends for help.
> Waldirio
>
> Cheers
> Riyaj Shamsudeen
> The Pythian Group www.pythian.com <http://www.pythian.com/>
> Personal blog: orainternals.wordpress.com
> <http://orainternals.wordpress.com/>
>
> Waldirio Manhães Pinheiro wrote:
>
> Hello Friends
> I'd like to ask about Oracle RAC in Linux environment. I
> installed two machine with RedHat AS 4Up5 and Oracle 10.2.0.3
> <http://10.2.0.3/> <http://10.2.0.3/> with ClusterWare. The
> installation finish with successful and the data base work fine.
> I checked my environment of availability with the test below:
> Station cambeba UP
> Station cangua UP
> # crs_stat -t
> Name Type Target State Host
> ------------------------------------------------------------
> ora....BA.lsnr application ONLINE ONLINE cambeba
> ora....eba.gsd application ONLINE ONLINE cambeba
> ora....eba.ons application ONLINE ONLINE cambeba
> ora....eba.vip application ONLINE ONLINE cambeba
> ora....UA.lsnr application ONLINE ONLINE cangua
> ora.cangua.gsd application ONLINE ONLINE cangua
> ora.cangua.ons application ONLINE ONLINE cangua
> ora.cangua.vip application ONLINE ONLINE cangua
> ora.ora10gq.db application ONLINE ONLINE cangua
> ora....q1.inst application ONLINE ONLINE cangua
> ora....q2.inst application ONLINE ONLINE cambeba
> At this point, that's ok, but when I force a power off in
> cangua or cambeba (the name of my machines), the time to the
> firt machine detect the second machine powered off is very big
> (between 1 and 2 min), so, if my client was working, will lost
> the query for time out.
> I changed the configurations in objects ora.cambeba.vip and
> ora.cangua.vip, but without successful.
> Any Ideia to fix this problem (decrease the time of check
> between nodes on cluster) ?!?!
> PS: I checked in list database, but without successful about
> this problem
>
> Thanks in advanced.
> --
> ______________
> Atenciosamente
> Waldirio
> msn: wmp_at_sinope.com.br <mailto:wmp_at_sinope.com.br>
> <mailto:wmp_at_sinope.com.br <mailto:wmp_at_sinope.com.br>>
> Site: www.waldirio.com.br <http://www.waldirio.com.br/>
> <http://www.waldirio.com.br/>
> Blog: blog.waldirio.com.br <http://blog.waldirio.com.br/>
> <http://blog.waldirio.com.br/>
> PGP: www.waldirio.com.br/public.html
> <http://www.waldirio.com.br/public.html>
> <http://www.waldirio.com.br/public.html>
>
>
>
>
>
> --
> ______________
> Atenciosamente
> Waldirio
> msn: wmp_at_sinope.com.br <mailto:wmp_at_sinope.com.br>
> Site: www.waldirio.com.br <http://www.waldirio.com.br>
> Blog: blog.waldirio.com.br <http://blog.waldirio.com.br>
> PGP: www.waldirio.com.br/public.html
> <http://www.waldirio.com.br/public.html>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jun 12 2008 - 14:57:17 CDT

Original text of this message