Re: RAC node "has a disk HB, but no network HB" but traceroute resports no problem

From: Justin Mungal <justin_at_n0de.ws>
Date: Wed, 4 Jan 2017 01:13:22 -0600
Message-ID: <CAO9=aUymHt_6h+nB-uMUkEzdNh2fGgpgs8ittYank+cpktk17A_at_mail.gmail.com>



"no network HB" means that the Network Heartbeat is failing for some reason.

This is rather anecdotal, but a RAC that my co-worker is responsible for was evicting nodes in a similar manner (similar environment as well) without any evident network problems. His theory was that the heartbeats were failing because the interconnect was not responding fast enough, due to all of the existing database activity. He enabled jumbo frames and the problem went away. So in other words the TCP/IP stack was busy disassembling and reassembling frames and this caused heartbeat responses to get slower, and enabling jumbo frames reduced that overhead.

Recommendation for the Real Application Cluster Interconnect and Jumbo Frames (Doc ID 341788.1)

You can also investigate your timeout settings and adjust them, but Oracle generally doesn't recommend this and will probably just tell you to install 11.2.0.4 instead.

CSS Timeout Computation in Oracle Clusterware (Doc ID 294430.1)

On Tue, Jan 3, 2017 at 4:20 PM, Yong Huang <dmarc-noreply_at_freelists.org> wrote:

> Oracle and GI (grid infrastructure) 11.2.0.3 on 64-bit Red Hat Linux 6.6.
> Cisco UCS.
>
> Node 2 of a 2-node RAC crashed. Log ocssd.log shows:
>
> 2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: node
> d1prpcrndb1a (1) at 50% heartbeat fatal, removal in 14.760 seconds
> 2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: node
> d1prpcrndb1a (1) is impending reconfig, flag 2493454, misstime 15240
> 2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: local
> diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending
> reconfig status(1)
> 2016-12-18 02:03:06.307: [ CSSD][510686976]clssnmvDHBValidateNcopy: node
> 1, d1prpcrndb1a, has a disk HB, but no network HB, DHB has rcfg 306434975,
> wrtcnt, 197140394, LATS 4040636964, lastSeqNo 185041690, uniqueness
> 1468029747, timestamp 1482048185/1112586906
> ...[some lines snipped here]...
> 2016-12-18 02:03:28.094: [ CSSD][510686976]clssnmvDHBValidateNcopy: node
> 1, d1prpcrndb1a, has a disk HB, but no network HB, DHB has rcfg 306434975,
> wrtcnt, 197140475, LATS 4040658754, lastSeqNo 197140472, uniqueness
> 1468029747, timestamp 1482048207/1112608986
>
> We installed Oracle's OSWatcher and enabled traceroute for the private
> network, which shows no error during the time:
>
> zzz ***Sun Dec 18 02:03:28 CST 2016
> traceroute to dcprpcrndb1bic1 (10.114.21.3), 30 hops max, 60 byte packets
> 1 dcprpcrndb1bic1 (10.114.21.3) 0.020 ms 0.008 ms 0.004 ms
> traceroute to dcprpcrndb1bic2 (10.114.21.67), 30 hops max, 60 byte packets
> 1 dcprpcrndb1bic2 (10.114.21.67) 0.020 ms 0.006 ms 0.004 ms
> traceroute to d1prpcrndb1aic1 (10.114.21.2), 30 hops max, 60 byte packets
> 1 d1prpcrndb1aic1 (10.114.21.2) 0.262 ms 0.259 ms 0.255 ms
> traceroute to d1prpcrndb1aic2 (10.114.21.66), 30 hops max, 60 byte packets
> 1 d1prpcrndb1aic2 (10.114.21.66) 0.135 ms 0.123 ms 0.110 ms
>
> If traceroute never reports a problem, what does "no network HB" in
> occsd.log mean? At 02:03:28, we see both "no network HB" and successful
> traceroute pings. This is not the first time we have this problem. The
> network team never finds any issue, consist with the traceroute report.
>
> OSWatcher traceroute has only basic options:
> traceroute -r -F <private network IP>
> where -r means "Bypass the normal routing tables and send directly to a
> host on an attached network". -F means "Do not fragment probe packets".
>
> /var/log/messages reports no problem at the time. It only starts to show
> problems after the cluster already decides on eviction.
>
> Yong Huang
> --
>
http://www.freelists.org/webpage/oracle-l
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Jan 04 2017 - 08:13:22 CET

Original text of this message