Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

From: Marko Sutic <marko.sutic_at_gmail.com>
Date: Fri, 26 Aug 2011 18:05:35 +0200
Message-ID: <CAMD6WPeDxKXOJzp69K6wy1VvL=jtzisEoRdaGDNKQeFDfh_A2A_at_mail.gmail.com>



Freek,

there were lots of "tur checker reports path is down" errors.

Also when I filtered /var/log/messages for "I/O error" there were lots of messages like:
...
Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdcn, sector 2367
Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdbs, sector 2247
Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdbk, sector 5103799
Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdbk, sector 6193375
Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdci, sector 2367
Aug 25 22:58:56 l01ora3 kernel: end_request: I/O error, dev sdci, sector 9330267
...

Sysadmins replaced FC switch and rebooted nodes. It seems that not all paths were activated or something like that. (I will ask detailed answer next week)

After that action this errors disappeared and now import has not failed after hour/two (it is still active).

Also I've collected OS statistics and asked Oracle support to help me.

Answer from Oracle support:



Please find my analysis below.

zzz ***Thu Aug 25 22:09:28 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st 6 7 0 59341556 164072 5250912 0 0 107 20 2 32 0 0 99 1 0 0 7 0 59338752 164072 5250904 0 0 0 0 1017 5378 0 0 87 12 0 0 7 0 59338628 164072 5250904 0 0 0 8 1016 5168 0 0 87 13 0

++ From above we can see there might be some disk problems because b>r

++ bi, bo are having high values indicates that there is a problem with I/O or storage.

The vmstat "b" column

Any count in the "b" column of vmstat is indicative of threads blocked via sema_p in the biowait state.

This indicates that an I/O call has been made and that Solaris is waiting on a return response.

Note: I/O can be blocked prior to waiting on a return response. In particular, if the number of commands active on a device is greater than or equal to the value of sd_max_throttle, all threads requesting I/O through that device will block prior to waiting for the I/O. This will not be reflected in the "b" column count.

++ No heavey CPU usage

++ Nothing wrong i can find in top command

++ sda, sda2 devices have heavry writes for seconds.

Please check with your system admin why I/O issues are there in the system. If you would like to raise a new SR with storage regarding this, that s fine.


My current vmstat statistics are:

....
zzz ***Fri Aug 26 14:18:39 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 6 5 0 59010308 222532 5545300 0 0 436 90 23 24 1 0 98 2 0
 0 5 0 59008336 222532 5545332 0 0 17 468 1077 4482 0 0 93 6 0
 0 5 0 59008088 222532 5545356 0 0 2 1 1019 4189 0 0 94 6 0
zzz ***Fri Aug 26 14:18:49 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 5 5 0 59010852 222540 5545516 0 0 436 90 23 24 1 0 98 2 0
 0 5 0 59007824 222544 5545532 0 0 1 369 1132 3625 0 0 93 6 0
 0 5 0 59007576 222544 5545536 0 0 2 1 1021 3219 0 0 94 6 0
zzz ***Fri Aug 26 14:18:59 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 5 5 0 59009780 222548 5545652 0 0 436 90 23 24 1 0 98 2 0
 0 5 0 59006268 222548 5545640 0 0 0 449 1034 4368 0 0 93 6 0
 0 5 0 59006376 222548 5545640 0 0 19 50 1035 3300 0 0 94 6 0
zzz ***Fri Aug 26 14:19:09 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 5 6 0 59000056 222580 5545700 0 0 436 90 23 24 1 0 98 2 0
 0 6 0 59001064 222580 5545720 0 0 0 364 1040 5069 0 0 93 6 0
 0 6 0 59006064 222584 5545676 0 0 1 100 1048 4974 0 0 94 6 0
zzz ***Fri Aug 26 14:19:19 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 5 0 0 59010232 222668 5545780 0 0 436 90 23 24 1 0 98 2 0
 0 7 0 59005660 222672 5545932 0 0 14337 197 1233 3823 2 1 89 8 0
 1 6 0 59005536 222680 5545984 0 0 28 470 1038 3290 0 0 87 13 0
zzz ***Fri Aug 26 14:19:29 CEST 2011
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 6 6 0 58998056 222680 5546032 0 0 436 90 23 24 1 0 98 2 0
 0 7 0 58995104 222680 5545996 0 0 16 216 1030 4447 0 0 87 13 0
 0 6 0 58995488 222680 5546056 0 0 3 469 1046 4248 0 0 87 13 0
....

I hope that this is OK now.

Thank you for your help.

Regards,
Marko

On Fri, Aug 26, 2011 at 5:07 PM, D'Hooge Freek <Freek.DHooge_at_uptime.be>wrote:

> Marco,
>
> Your system is complaining about lost disks and io paths.
> I also see that your multipathing is configured so that it will wait when
> all paths to a disk are lost (queue_if_no_path). When that happens, the
> clusterware will start reporting unaccessible voting disks, while the other
> processes will (appear to) hang.
>
> Can you check if you still get errors like: "tur checker reports path is
> down" or "kernel: end_request: I/O error,...".
> If not, check if they start to appear when you put some load on the io
> subsystem.
>
>
> Regards,
>
>
> Freek D'Hooge
> Uptime
> Oracle Database Administrator
> email: freek.dhooge_at_uptime.be
> tel +32(0)3 451 23 82
> http://www.uptime.be
> disclaimer: www.uptime.be/disclaimer
> ---
> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
> On Behalf Of Marko Sutic
> Sent: vrijdag 26 augustus 2011 10:34
> To: David Barbour
> Cc: oracle-l_at_freelists.org
> Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620
> ms
>
> Hi David,
>
> /var/log/messages is stuffed with various messages and I cannot identify
> what is important to look for.
>
> I will attach excerpt of log file from the period during import and when
> failure occurred.
>
> If you notice something odd please let me know.
>
> Regards,
> Marko
>
>
> On Fri, Aug 26, 2011 at 12:14 AM, David Barbour <david.barbour1_at_gmail.com>
> wrote:
> Anything in /var/log/messages?
>
> On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic <marko.sutic_at_gmail.com>
> wrote:
> Freek,
>
> you are correct - heartbeat fatal messages are there due to the missing
> voting disk.
>
> I have another database up and running on second node and this database is
> using same ocfs2 volume for Oracle database files as the first one.
> This database is running without any error so I suppose that other OCFS2
> volumes were accessible in the time of the failure.
>
> In this configuration are 3 voting disk files located on 3 different luns
> and separate OCFS2 volumes. When failure occurs two of three voting devices
> hang.
>
> It is also worth to mention that nothing else is running on that node
> except import.
>
>
> I simply can't figure out why two of three voting disks hang.
>
>
> Regards,
> Marko
>
>
> On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek <Freek.DHooge_at_uptime.be>
> wrote:
> Marco,
>
> I don't know the error timings for the other node, but I think the
> heartbeat fatal messages are coming after the first node has terminated due
> to the missing voting disk.
>
> This would indicate that there is no general problem with the voting disk
> itself, but that the problem is specific to the first node.
> Either the connection itself or the load or an ocfs2 bug would then be the
> cause of the error.
>
> Do you know if at the time of the failure the other OCFS2 volumes where
> still accessible?
> Are your voting disks placed on the same luns as your database files or are
> they on a separate ocfs2 volume?
>
> Regards,
>
>
> Freek D'Hooge
> Uptime
> Oracle Database Administrator
> email: freek.dhooge_at_uptime.be
> tel +32(0)3 451 23 82
> http://www.uptime.be
> disclaimer: www.uptime.be/disclaimer
> ---
> From: Marko Sutic [mailto:marko.sutic_at_gmail.com]
> Sent: donderdag 25 augustus 2011 10:51
> To: D'Hooge Freek
> Cc: oracle-l_at_freelists.org
> Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620
> ms
>
> Errors messages from another node:
>
> 2011-08-25 10:38:33.563
> [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in
> 14.000 seconds
> 2011-08-25 10:38:40.558
> [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
> 7.010 seconds
> 2011-08-25 10:38:41.560
> [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
> 6.010 seconds
> 2011-08-25 10:38:45.558
> [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
> 2.010 seconds
> 2011-08-25 10:38:46.560
> [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
> 1.010 seconds
> 2011-08-25 10:38:47.562
> [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
> 0.010 seconds
> 2011-08-25 10:38:47.574
> [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
> /u01/app/crs/log/l01ora4/cssd/ocssd.log.
> 2011-08-25 10:39:01.579
> [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
> l01ora4 .
>
>
> Regards,
> Marko
>
>
>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Aug 26 2011 - 11:05:35 CDT

Original text of this message