Grid Infrastructure and RAC: the cluster interconnect

articles: 

A necessary part of a cluster is the cluster interconnect: the private network between the cluster nodes. This component is critical for both reliability and scalability, and configuring it for fault tolerance and capacity is a matter of burning importance for the DBA and his System Administrator (not to mention the end users). One question is, should this be managed by the operating system or by Oracle?

The interconnect is used by the clusterware for messaging necessary to maintain the integrity of the cluster and also by database instances for global cache management. The most critical clusterware traffic for reliability is the messaging between the CSS (Cluster Synchronization Service) daemons on each node. These messages allow the nodes to discover each other, and build up the topology of the cluster. Any failure in the interconnect will cause the nodes to lose contact with each other, and a node eviction will follow: at best, the cluster services will shutdown along with all managed resources; at worst, a node will reboot. The traffic most critical for performance and scalability is cache fusion. This is the copying of block images between instances as part of the global cache management, and messaging necessary for managing global enqueues. If the bandwidth is not adequate for the level and distribution of activity, sessions will hang on various wait events to do with the global cache and global enqueue services.

The straightforward answer to scalability and reliability problems on the interconnect is to install a second (or a third, or fourth, or fifth....) network link. If each link is (for example) a 10GHz ethernet card, a second such card will double the carrying capacity of the interconnect to 20Gbits per second , and halve the likelihood of a complete interconnect failure (assuming, of course, that the switch has adequate capacity and is itself not liable to failure). No problem - except that decision must be made on how to manage multiple network interface cards (or NICs).

Traditionally, network fault tolerance has been managed by the system administrator using network link aggregation. This has been available with Linux (where it is usually referred to as "NIC bonding") since the 2.x kernel, and with Windows (usually known as "NIC teaming") since Server 2012 (I think those release are correct). This is a Linux example, which bonds interfaces eth1 and eth2 into a single interface named bond0:

modprobe bonding mode=0 miimon=100 # load the bonding module
ifconfig eth1 down	# bring down the eth1 interface
ifconfig eth2 down	# bring down the eth2 interface
ifconfig bond0 hw ether 02:11:22:33:44:55	# create bond0, as an ethernet NIC with a MAC address
ifconfig bond0 192.168.55.55 up	# bring up bond0 with an IP address
ifenslave bond0 eth1	# put eth1 into slave mode for bond0
ifenslave bond0 eth2	# put eth2 into slave mode for bond0

From now on, all traffic to 192.168.55.55 on the logical NIC will be balanced by Linux across the two physical NICs, and if either fails all traffic will be directed to the survivor.
Two points to note:
First, the use of mode=0, which enables load balancing. A commonly used alternative is mode=1, which is an active-passive mode where the second NIC is brought into use only on failure of the first.
Second, the MAC adress. This should be from one of the blocks assigned by IEEE as locally administered addresses, such as anything beginning with 02.

To use link aggregation for the cluster interconnect, pre-create the bonded interface, and select it as the private network at install time.

From Grid Infrasrtucture release 11.2.0.2, there is an alternative to link aggregation: let Oracle manage it. Rather than bonding the interconnect NICs, leave them both configured with static fixed addresses and during the installation dialogue specify both interfaces as private. Grid Infrastructure will assign a virtual IP address (on the 169.245.x.x link-local subnet) to each NIC, and use its HAIP protocol (Highly Available Interconnect Protocol) to balance traffic across them and manage failover if one fails. The result is equivalent to bonding mode 0; there is not (as far as I know) an active-passive option.

Which technique should you use? Advice from Uncle Oracle is clear: use HAIP, do not use bonding any more. However, HAIP is limited to four NICs. What if you have more than that? Or what if you want a guarantee of maintained performance in the event of failure of any one NIC? In that case, combine the techniques. Say you have eight NICs: use link aggregation to bond them in pairs (using mode 0 or 1, depending on whether you want maximum bandwidth or guaranteed bandwidth) and when installing Grid Infrastructure select the four bonded NICs for the private network. Easy: the best of both worlds.

--
John Watson
Oracle Certified Master DBA
http://skillbuilders.com

----------------------------------------------------------------------------------------------------
Some more detail if anyone is interested:
First, an example of setting up bonding on Linux (note the figures for RX and TX packets on the bond0, eth1, and eth2 NICs and that bond0 is running MASTER, eth1 and eth2 re running SLAVE)
Second, an example of HAIP balancing traffic across two interfaces (note the virtual addresses on eth1 and eth2, and that the RX and TX figures are roughly equal for each).

[root@gold1 ~]# modprobe bonding mode=0 miimon=100
[root@gold1 ~]# ifconfig eth1 down
[root@gold1 ~]# ifconfig eth2 down
[root@gold1 ~]# ifconfig bond0 hw ether 02:11:22:33:44:55
[root@gold1 ~]# ifconfig bond0 192.168.55.55 up
[root@gold1 ~]# ifenslave bond0 eth1
[root@gold1 ~]# ifenslave bond0 eth2
[root@gold1 ~]# ifconfig
bond0     Link encap:Ethernet  HWaddr 02:11:22:33:44:55
          inet addr:192.168.55.55  Bcast:192.168.55.255  Mask:255.255.255.0
          inet6 addr: fe80::11:22ff:fe33:4455/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:75 errors:1 dropped:0 overruns:0 frame:0
          TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12813 (12.5 KiB)  TX bytes:4652 (4.5 KiB)

eth0      Link encap:Ethernet  HWaddr 08:00:27:C9:68:BC
          inet addr:192.168.56.11  Bcast:192.168.56.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fec9:68bc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:397 errors:0 dropped:0 overruns:0 frame:0
          TX packets:297 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:40454 (39.5 KiB)  TX bytes:50544 (49.3 KiB)
          Interrupt:19 Base address:0xd000

eth1      Link encap:Ethernet  HWaddr 02:11:22:33:44:55
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:25 errors:0 dropped:0 overruns:0 frame:0
          TX packets:28 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3448 (3.3 KiB)  TX bytes:4149 (4.0 KiB)
          Interrupt:16 Base address:0xd040

eth2      Link encap:Ethernet  HWaddr 02:11:22:33:44:55
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:50 errors:1 dropped:0 overruns:0 frame:0
          TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:9365 (9.1 KiB)  TX bytes:503 (503.0 b)
          Interrupt:17 Base address:0xd060

[oracle@iron2 ~]$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:73:C4:39
          inet addr:192.168.56.32  Bcast:192.168.56.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe73:c439/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:854 errors:1 dropped:0 overruns:0 frame:0
          TX packets:665 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:167787 (163.8 KiB)  TX bytes:101911 (99.5 KiB)
          Interrupt:19 Base address:0xd000

eth0:1    Link encap:Ethernet  HWaddr 08:00:27:73:C4:39
          inet addr:192.168.56.140  Bcast:192.168.56.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:19 Base address:0xd000

eth0:2    Link encap:Ethernet  HWaddr 08:00:27:73:C4:39
          inet addr:192.168.56.142  Bcast:192.168.56.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:19 Base address:0xd000

eth1      Link encap:Ethernet  HWaddr 08:00:27:2B:D8:40
          inet addr:10.1.1.31  Bcast:10.1.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe2b:d840/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:73088 errors:36 dropped:0 overruns:0 frame:0
          TX packets:58459 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:51018384 (48.6 MiB)  TX bytes:32203904 (30.7 MiB)
          Interrupt:16 Base address:0xd040

eth1:1    Link encap:Ethernet  HWaddr 08:00:27:2B:D8:40
          inet addr:169.254.141.188  Bcast:169.254.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:16 Base address:0xd040

eth2      Link encap:Ethernet  HWaddr 08:00:27:C0:B0:73
          inet addr:10.1.1.32  Bcast:10.1.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fec0:b073/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:71786 errors:0 dropped:0 overruns:0 carrier:0
          TX packets:57078 errors:2 dropped:0 overruns:0 frame:0
          collisions:0 txqueuelen:1000
          RX bytes:50324011 (47.9 MiB) TX bytes:31197441 (29.7 MiB)  
          Interrupt:17 Base address:0xd060

eth2:1    Link encap:Ethernet  HWaddr 08:00:27:C0:B0:73
          inet addr:169.254.100.187  Bcast:169.254.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:17 Base address:0xd060