Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

From: Marko Sutic <marko.sutic_at_gmail.com>
Date: Fri, 26 Aug 2011 10:34:00 +0200
Message-ID: <CAMD6WPe4xyuCpqJK82x3-b181pR_=X1y4kcBpMUwmA98u_LU4g_at_mail.gmail.com>



Hi David,

/var/log/messages is stuffed with various messages and I cannot identify what is important to look for.

I will attach excerpt of log file from the period during import and when failure occurred.

If you notice something odd please let me know.

Regards,
Marko

On Fri, Aug 26, 2011 at 12:14 AM, David Barbour <david.barbour1_at_gmail.com>wrote:

> Anything in /var/log/messages?
>
> On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic <marko.sutic_at_gmail.com>wrote:
>
>> Freek,
>>
>> you are correct - heartbeat fatal messages are there due to the missing
>> voting disk.
>>
>> I have another database up and running on second node and this database is
>> using same ocfs2 volume for Oracle database files as the first one.
>> This database is running without any error so I suppose that other OCFS2
>> volumes were accessible in the time of the failure.
>>
>> In this configuration are 3 voting disk files located on 3 different luns
>> and separate OCFS2 volumes. When failure occurs two of three voting devices
>> hang.
>>
>> It is also worth to mention that nothing else is running on that node
>> except import.
>>
>>
>> I simply can't figure out why two of three voting disks hang.
>>
>>
>> Regards,
>> Marko
>>
>>
>> On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek <Freek.DHooge_at_uptime.be>wrote:
>>
>>> Marco,
>>>
>>> I don't know the error timings for the other node, but I think the
>>> heartbeat fatal messages are coming after the first node has terminated due
>>> to the missing voting disk.
>>>
>>> This would indicate that there is no general problem with the voting disk
>>> itself, but that the problem is specific to the first node.
>>> Either the connection itself or the load or an ocfs2 bug would then be
>>> the cause of the error.
>>>
>>> Do you know if at the time of the failure the other OCFS2 volumes where
>>> still accessible?
>>> Are your voting disks placed on the same luns as your database files or
>>> are they on a separate ocfs2 volume?
>>>
>>> Regards,
>>>
>>>
>>> Freek D'Hooge
>>> Uptime
>>> Oracle Database Administrator
>>> email: freek.dhooge_at_uptime.be
>>> tel +32(0)3 451 23 82
>>> http://www.uptime.be
>>> disclaimer: www.uptime.be/disclaimer
>>> ---
>>> From: Marko Sutic [mailto:marko.sutic_at_gmail.com]
>>> Sent: donderdag 25 augustus 2011 10:51
>>> To: D'Hooge Freek
>>> Cc: oracle-l_at_freelists.org
>>> Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in
>>> 99620 ms
>>>
>>> Errors messages from another node:
>>>
>>> 2011-08-25 10:38:33.563
>>> [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction
>>> in 14.000 seconds
>>> 2011-08-25 10:38:40.558
>>> [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction
>>> in 7.010 seconds
>>> 2011-08-25 10:38:41.560
>>> [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction
>>> in 6.010 seconds
>>> 2011-08-25 10:38:45.558
>>> [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction
>>> in 2.010 seconds
>>> 2011-08-25 10:38:46.560
>>> [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction
>>> in 1.010 seconds
>>> 2011-08-25 10:38:47.562
>>> [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction
>>> in 0.010 seconds
>>> 2011-08-25 10:38:47.574
>>> [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
>>> /u01/app/crs/log/l01ora4/cssd/ocssd.log.
>>> 2011-08-25 10:39:01.579
>>> [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
>>> l01ora4 .
>>>
>>>
>>> Regards,
>>> Marko
>>>
>>
>>
>>
>>



--
http://www.freelists.org/webpage/oracle-l


Received on Fri Aug 26 2011 - 03:34:00 CDT

Original text of this message