Re: Node Eviction and SGA

From: sathish balasubramaniam <sat0789_at_gmail.com>
Date: Mon, 25 Aug 2008 16:45:59 +0400
Message-ID: <92bacc9f0808250545s38e46b9p1cc7e6ca24dc387f@mail.gmail.com>


Agreed on that front with regards to cpu cycles, but the question remains why all of sudden. This system was working perfectly ok for 4 months. Also oswatcher and the application /os logs showed no cpu max utilization during the time frame this event ocurred.

On 8/25/08, Andrew Kerber <andrew.kerber_at_gmail.com> wrote:
>
> I would have thought that the problem would have appeared earlier, but here
> is one possible scenario where increasing the sga might solve the problem
> (and by the way, in these days, 800M sga is pretty small):
>
> The "heartbeat" is really just a response to a ping, ie, asking "are you
> still there, yes I am here". If the cpu on one server hits max utilization
> due to processing of queries (hard parses), and is too busy to respond it
> could get ejected. Adding space to the sga will increase the cache size,
> thus reducing the hard parses, and giving the cpu's cycles to respond to the
> heartbeat.
>
> On Mon, Aug 25, 2008 at 7:22 AM, sathish balasubramaniam <
> sat0789_at_gmail.com> wrote:
>
>> Hello All,
>> We have a 6 node RAC on 10g rel 2 / windows 2003 64 bit. It was working
>> well from all aspects.
>> About 3 weeks back ( 3 days before i was to go for my vacation) SA
>> needed to add more power modules, so the entire system (including SAN) was
>> powered down and then brought back up. DB m/c by themselves have undergone a
>> complete reboot before without any issues. This time it was the entire IT
>> system.
>> Two days after that, all out of sudden, we starting witnessing node
>> eviction issues. Every day one node would get evicted but the m/c would not
>> go down. The typical messages seen were (below is the message from ocssd.log
>> on node 2 ) ..
>> ----------------
>> [ CSSD]2008-07-27 16:04:14.605 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 50% heartbeat fatal, eviction in 29.125 seconds
>> [ CSSD]2008-07-27 16:04:29.605 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 75% heartbeat fatal, eviction in 14.125 seconds
>> [ CSSD]2008-07-27 16:04:38.606 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 90% heartbeat fatal, eviction in 5.125 seconds
>> [ CSSD]2008-07-27 16:04:39.606 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 90% heartbeat fatal, eviction in 4.125 seconds
>> [ CSSD]2008-07-27 16:04:40.606 [5540] >TRACE: clssnmPollingThread:
>> node serv-db01 (1) is impending reconfig
>> [ CSSD]2008-07-27 16:04:40.606 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 90% heartbeat fatal, eviction in 3.125 seconds
>> [ CSSD]2008-07-27 16:04:40.606 [5540] >TRACE: clssnmPollingThread:
>> diskTimeout set to (57000)ms impending reconfig status(1)
>> [ CSSD]2008-07-27 16:04:41.606 [5540] >TRACE: clssnmPollingThread:
>> node serv-db01 (1) is impending reconfig
>> [ CSSD]2008-07-27 16:04:41.606 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 90% heartbeat fatal, eviction in 2.125 seconds
>> [ CSSD]2008-07-27 16:04:42.606 [5540] >TRACE: clssnmPollingThread:
>> node serv-db01 (1) is impending reconfig
>> [ CSSD]2008-07-27 16:04:42.606 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 90% heartbeat fatal, eviction in 1.125 seconds
>> [ CSSD]2008-07-27 16:04:43.606 [5540] >TRACE: clssnmPollingThread:
>> node serv-db01 (1) is impending reconfig
>> [ CSSD]2008-07-27 16:04:43.606 [5540] >WARNING: clssnmPollingThread:
>> node serv-db01 (1) at 90% heartbeat fatal, eviction in 0.125 seconds
>> [ CSSD]2008-07-27 16:04:43.731 [5540] >TRACE: clssnmPollingThread:
>> node serv-db01 (1) is impending reconfig
>> [ CSSD]2008-07-27 16:04:43.731 [5540] >TRACE: clssnmPollingThread:
>> Eviction started for node serv-db01 (1), flags 0x000f, state 3, wt4c 0
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
>> Initiating sync 8
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
>> diskTimeout set to (57000)ms
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait: Ack
>> message type (11)
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
>> node(1) is ALIVE
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
>> node(2) is ALIVE
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
>> node(3) is ALIVE
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
>> node(4) is ALIVE
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait:
>> node(5) is ALIVE
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSendSync:
>> syncSeqNo(8)
>> [ CSSD]2008-07-27 16:04:43.731 [5648] >TRACE: clssnmHandleSync:
>> Acknowledging sync: src[2] srcName[serv-db02] seq[1] sync[8]
>> [ CSSD]2008-07-27 16:04:43.731 [5648] >TRACE: clssnmHandleSync:
>> diskTimeout set to (57000)ms
>> [ CSSD]2008-07-27 16:04:43.731 [4340] >USER: NMEVENT_SUSPEND
>> [00][00][00][3e]
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks: Ack
>> message type(11), ackCount(4)
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks:
>> node(1) is expiring, msg type(11)
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmWaitForAcks:
>> done, msg type(11)
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmDoSyncUpdate:
>> Terminating node 1, serv-db01, misstime(60000) state(3)
>> [ CSSD]2008-07-27 16:04:43.731 [5640] >TRACE: clssnmSetupAckWait: Ack
>> message type (13)
>>
>> ------------------------------------------------------
>> No information was written to the alert logs on all the nodes.
>> . We contacted oracle support and they were saying its a n/w issue etc,.
>> But my SA was adament that its an oracle problem. Anyway i went for my
>> vacation. There was a suggestion (SA had an oracle contact) that SGA needs
>> to be increased. It was at 800 mb per node. My junoir dba was forced to
>> raise it to 2 gb on each node based on SA's suggestion. Then all of a sudden
>> from the next day, node eviction stopped.
>> I cannot still beleive that increasing the SGA has got anything to do
>> with node eviction. I told my upper mgmt that node eviction has nothing to
>> do with the SGA. But the consensus in my IT dept is SGA increase solved the
>> issue. Does anybdy think there is any connection between increase in SGA and
>> node eviction. ?. I have read the node eviction papers in metalink and they
>> do not mention about SGA at all.
>>
>> I would really appriciate any help in this regard.
>>
>> Thank You,
>>
>> Sat
>>
>>
>>
>>
>
>
>
> --
> Andrew W. Kerber
>
> 'If at first you dont succeed, dont take up skydiving.'
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 25 2008 - 07:45:59 CDT

Original text of this message