Re: Node Eviction and SGA

From: Andrew Kerber <andrew.kerber_at_gmail.com>
Date: Mon, 25 Aug 2008 07:33:54 -0500
Message-ID: <ad3aa4c90808250533kc9728bcm67ce78f4f37cc2af@mail.gmail.com>


I would have thought that the problem would have appeared earlier, but here is one possible scenario where increasing the sga might solve the problem (and by the way, in these days, 800M sga is pretty small):

The "heartbeat" is really just a response to a ping, ie, asking "are you still there, yes I am here". If the cpu on one server hits max utilization due to processing of queries (hard parses), and is too busy to respond it could get ejected. Adding space to the sga will increase the cache size, thus reducing the hard parses, and giving the cpu's cycles to respond to the heartbeat.

On Mon, Aug 25, 2008 at 7:22 AM, sathish balasubramaniam <sat0789_at_gmail.com>wrote:

> Hello All,
> We have a 6 node RAC on 10g rel 2 / windows 2003 64 bit. It was working
> well from all aspects.
> About 3 weeks back ( 3 days before i was to go for my vacation) SA
> needed to add more power modules, so the entire system (including SAN) was
> powered down and then brought back up. DB m/c by themselves have undergone a
> complete reboot before without any issues. This time it was the entire IT
> system.
> Two days after that, all out of sudden, we starting witnessing node
> eviction issues. Every day one node would get evicted but the m/c would not
> go down. The typical messages seen were (below is the message from ocssd.log
> on node 2 ) ..
> ----------------
> [ CSSD]2008-07-27 16:04:14.605 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 50% heartbeat fatal, eviction in 29.125 seconds
> [ CSSD]2008-07-27 16:04:29.605 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 75% heartbeat fatal, eviction in 14.125 seconds
> [ CSSD]2008-07-27 16:04:38.606 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 90% heartbeat fatal, eviction in 5.125 seconds
> [ CSSD]2008-07-27 16:04:39.606 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 90% heartbeat fatal, eviction in 4.125 seconds
> [ CSSD]2008-07-27 16:04:40.606 [5540] >TRACE: clssnmPollingThread:
> node serv-db01 (1) is impending reconfig
> [ CSSD]2008-07-27 16:04:40.606 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 90% heartbeat fatal, eviction in 3.125 seconds
> [ CSSD]2008-07-27 16:04:40.606 [5540] >TRACE: clssnmPollingThread:
> diskTimeout set to (57000)ms impending reconfig status(1)
> [ CSSD]2008-07-27 16:04:41.606 [5540] >TRACE: clssnmPollingThread:
> node serv-db01 (1) is impending reconfig
> [ CSSD]2008-07-27 16:04:41.606 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 90% heartbeat fatal, eviction in 2.125 seconds
> [ CSSD]2008-07-27 16:04:42.606 [5540] >TRACE: clssnmPollingThread:
> node serv-db01 (1) is impending reconfig
> [ CSSD]2008-07-27 16:04:42.606 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 90% heartbeat fatal, eviction in 1.125 seconds
> [ CSSD]2008-07-27 16:04:43.606 [5540] >TRACE: clssnmPollingThread:
> node serv-db01 (1) is impending reconfig
> [ CSSD]2008-07-27 16:04:43.606 [5540] >WARNING: clssnmPollingThread:
> node serv-db01 (1) at 90% heartbeat fatal, eviction in 0.125 seconds
> [ CSSD]2008-07-27 16:04:43.731 [5540] >TRACE: clssnmPollingThread:
> node serv-db01 (1) is impending reconfig
> [ CSSD]2008-07-27 16:04:43.731 [5540] >TRACE: clssnmPollingThread:
> Eviction started for node serv-db01 (1), flags 0x000f, state 3, wt4c 0
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
> Initiating sync 8
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
> diskTimeout set to (57000)ms
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait: Ack
> message type (11)
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
> node(1) is ALIVE
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
> node(2) is ALIVE
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
> node(3) is ALIVE
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
> node(4) is ALIVE
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
> node(5) is ALIVE
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSendSync:
> syncSeqNo(8)
> [ CSSD]2008-07-27 16:04:43.731 [5648] >TRACE: clssnmHandleSync:
> Acknowledging sync: src[2] srcName[serv-db02] seq[1] sync[8]
> [ CSSD]2008-07-27 16:04:43.731 [5648] >TRACE: clssnmHandleSync:
> diskTimeout set to (57000)ms
> [ CSSD]2008-07-27 16:04:43.731 [4340] >USER: NMEVENT_SUSPEND
> [00][00][00][3e]
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks: Ack
> message type(11), ackCount(4)
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks:
> node(1) is expiring, msg type(11)
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks: done,
> msg type(11)
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
> Terminating node 1, serv-db01, misstime(60000) state(3)
> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait: Ack
> message type (13)

>

> ------------------------------------------------------
> No information was written to the alert logs on all the nodes.
> . We contacted oracle support and they were saying its a n/w issue etc,.
> But my SA was adament that its an oracle problem. Anyway i went for my
> vacation. There was a suggestion (SA had an oracle contact) that SGA needs
> to be increased. It was at 800 mb per node. My junoir dba was forced to
> raise it to 2 gb on each node based on SA's suggestion. Then all of a sudden
> from the next day, node eviction stopped.
> I cannot still beleive that increasing the SGA has got anything to do
> with node eviction. I told my upper mgmt that node eviction has nothing to
> do with the SGA. But the consensus in my IT dept is SGA increase solved the
> issue. Does anybdy think there is any connection between increase in SGA and
> node eviction. ?. I have read the node eviction papers in metalink and they
> do not mention about SGA at all.
>

> I would really appriciate any help in this regard.
>

> Thank You,
>

> Sat
>
>
>
>



-- 
Andrew W. Kerber

'If at first you dont succeed, dont take up skydiving.'

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 25 2008 - 07:33:54 CDT

Original text of this message