Node Eviction and SGA

From: sathish balasubramaniam <sat0789_at_gmail.com>
Date: Mon, 25 Aug 2008 16:22:12 +0400
Message-ID: <92bacc9f0808250522l3853e208i9cb3d1c8689e127a@mail.gmail.com>


Hello All,

   We have a 6 node RAC on 10g rel 2 / windows 2003 64 bit. It was working well from all aspects.

      About 3 weeks back ( 3 days before i was to go for my vacation) SA needed to add more power modules, so the entire system (including SAN) was powered down and then brought back up. DB m/c by themselves have undergone a complete reboot before without any issues. This time it was the entire IT system.
Two days after that, all out of sudden, we starting witnessing node eviction issues. Every day one node would get evicted but the m/c would not go down. The typical messages seen were (below is the message from ocssd.log on node 2 ) ..



[ CSSD]2008-07-27 16:04:14.605 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 50% heartbeat fatal, eviction in 29.125 seconds
[ CSSD]2008-07-27 16:04:29.605 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 75% heartbeat fatal, eviction in 14.125 seconds
[ CSSD]2008-07-27 16:04:38.606 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 90% heartbeat fatal, eviction in 5.125 seconds
[ CSSD]2008-07-27 16:04:39.606 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 90% heartbeat fatal, eviction in 4.125 seconds
[ CSSD]2008-07-27 16:04:40.606 [5540] >TRACE: clssnmPollingThread: node
serv-db01 (1) is impending reconfig
[ CSSD]2008-07-27 16:04:40.606 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 90% heartbeat fatal, eviction in 3.125 seconds
[ CSSD]2008-07-27 16:04:40.606 [5540] >TRACE: clssnmPollingThread:
diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2008-07-27 16:04:41.606 [5540] >TRACE: clssnmPollingThread: node
serv-db01 (1) is impending reconfig
[ CSSD]2008-07-27 16:04:41.606 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 90% heartbeat fatal, eviction in 2.125 seconds
[ CSSD]2008-07-27 16:04:42.606 [5540] >TRACE: clssnmPollingThread: node
serv-db01 (1) is impending reconfig
[ CSSD]2008-07-27 16:04:42.606 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 90% heartbeat fatal, eviction in 1.125 seconds
[ CSSD]2008-07-27 16:04:43.606 [5540] >TRACE: clssnmPollingThread: node
serv-db01 (1) is impending reconfig
[ CSSD]2008-07-27 16:04:43.606 [5540] >WARNING: clssnmPollingThread: node
serv-db01 (1) at 90% heartbeat fatal, eviction in 0.125 seconds
[ CSSD]2008-07-27 16:04:43.731 [5540] >TRACE: clssnmPollingThread: node
serv-db01 (1) is impending reconfig
[ CSSD]2008-07-27 16:04:43.731 [5540] >TRACE: clssnmPollingThread:
Eviction started for node serv-db01 (1), flags 0x000f, state 3, wt4c 0
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
Initiating sync 8
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait: Ack
message type (11)
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
node(1) is ALIVE
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
node(2) is ALIVE
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
node(3) is ALIVE
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
node(4) is ALIVE
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
node(5) is ALIVE
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSendSync:
syncSeqNo(8)
[ CSSD]2008-07-27 16:04:43.731 [5648] >TRACE: clssnmHandleSync:
Acknowledging sync: src[2] srcName[serv-db02] seq[1] sync[8]
[ CSSD]2008-07-27 16:04:43.731 [5648] >TRACE: clssnmHandleSync:
diskTimeout set to (57000)ms
[ CSSD]2008-07-27 16:04:43.731 [4340] >USER: NMEVENT_SUSPEND
[00][00][00][3e]
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks: Ack
message type(11), ackCount(4)
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks:
node(1) is expiring, msg type(11)
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks: done,
msg type(11)
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
Terminating node 1, serv-db01, misstime(60000) state(3)
[ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait: Ack
message type (13)

No information was written to the alert logs on all the nodes. . We contacted oracle support and they were saying its a n/w issue etc,. But my SA was adament that its an oracle problem. Anyway i went for my vacation. There was a suggestion (SA had an oracle contact) that SGA needs to be increased. It was at 800 mb per node. My junoir dba was forced to raise it to 2 gb on each node based on SA's suggestion. Then all of a sudden from the next day, node eviction stopped.  I cannot still beleive that increasing the SGA has got anything to do with node eviction. I told my upper mgmt that node eviction has nothing to do with the SGA. But the consensus in my IT dept is SGA increase solved the issue. Does anybdy think there is any connection between increase in SGA and node eviction. ?. I have read the node eviction papers in metalink and they do not mention about SGA at all.

I would really appriciate any help in this regard.

Thank You,

Sat

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 25 2008 - 07:22:12 CDT

Original text of this message