Keepalived

為什麼我在keepalived中遇到了腦裂問題?

  • October 26, 2020

當我啟動我的 BACKUP keepalived 實例時,它還假定 MASTER 狀態,如下所示:

Mar 28 02:38:05 localhost Keepalived_vrrp[23527]: VRRP_Instance(VI_01) Entering BACKUP STATE
Mar 28 02:38:05 localhost Keepalived_vrrp[23527]: VRRP sockpool: [ifindex(2), proto(112), unicast(1), fd(10,11)]
Mar 28 02:38:05 localhost Keepalived_vrrp[23527]: VRRP_Script(check_haproxy) succeeded
Mar 28 02:38:17 localhost Keepalived_vrrp[23527]: VRRP_Instance(VI_01) Transition to MASTER STATE
Mar 28 02:38:21 localhost Keepalived_vrrp[23527]: VRRP_Instance(VI_01) Entering MASTER STATE

主配置:

# Script used to check if HAProxy is running
vrrp_script check_haproxy {
script "/usr/sbin/pidof haproxy"
interval 2
}
# Virtual interface
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
state MASTER
interface eth0
advert_int 4
unicast_src_ip 10.1.2.50
unicast_peer {
       10.1.2.51
   }
virtual_router_id 51
priority 150
# The virtual ip address shared between the two loadbalancers
virtual_ipaddress {
   10.1.2.100
}
track_script {
check_haproxy
}

備份配置:

# Script used to check if HAProxy is running
vrrp_script check_haproxy {
script "/usr/sbin/pidof haproxy"
interval 2
}
# Virtual interface
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
state BACKUP
advert_int 4
interface eth0
unicast_src_ip 10.1.2.51
unicast_peer {
       10.1.2.50
   }
virtual_router_id 51
priority 100
# The virtual ip address shared between the two loadbalancers
virtual_ipaddress {
   10.1.2.100
}
track_script {
check_haproxy
}
}

然後我繼續檢查這兩個實例是否正在相互交談:

掌握

$ tcpdump -i eth0 'ip proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
02:48:33.557462 IP host1.novalocal > 10.1.2.51: VRRPv2, Advertisement, vrid 51, prio 101, authtype none, intvl 4s, length 20
02:48:37.558487 IP host1.novalocal > 10.1.2.51: VRRPv2, Advertisement, vrid 51, prio 101, authtype none, intvl 4s, length 20
02:48:41.559496 IP host1.novalocal > 10.1.2.51: VRRPv2, Advertisement, vrid 51, prio 101, authtype none, intvl 4s, length 20

備份

$ tcpdump -i eth0 'ip proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
02:49:38.269751 IP host2.novalocal > 10.1.2.50: VRRPv2, Advertisement, vrid 51, prio 100, authtype none, intvl 1s, length 20
02:49:39.270461 IP host2.novalocal > 10.1.2.50: VRRPv2, Advertisement, vrid 51, prio 100, authtype none, intvl 1s, length 20
02:49:40.271197 IP host2.novalocal > 10.1.2.50: VRRPv2, Advertisement, vrid 51, prio 100, authtype none, intvl 1s, length 20

關於為什麼 BACKUP 實例無法辨識 MASTER 的任何提示?

更新1:

iptables 結果:

掌握

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

備份

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

解決方案

原來是防火牆問題。我能夠通過tcpdump在目標主機上執行以驗證收到的廣告來驗證這一點。修復防火牆問題後,我現在得到了以前不存在的 vrrp 廣告。以下是在備份主機上執行的:

tcpdump -i eth0 src host 10.1.2.50
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
01:06:42.709813 IP 10.1.2.50 > sntstsvmrla2a02.novalocal: VRRPv2, Advertisement, vrid 51, prio 101, authtype none, intvl 1s, length 20
01:06:43.709901 IP 10.1.2.50 > sntstsvmrla2a02.novalocal: VRRPv2, Advertisement, vrid 51, prio 101, authtype none, intvl 1s, length 20

正如您的 tcpdump 所示,兩個系統都嘗試相互交談,但沒有收到任何答案。所以兩者都認為另一個系統已關閉,並且備份完成了它的用途。您需要找出阻礙通信的原因。

僅出於我的環境(作為通用用途有問題),當 SPT 算法重新計算路由(最多 30 秒)時,我的大腦分裂僅用於瞬時核心交換機故障轉移,這可能在核心交換機韌體升級期間發生。

#MASTER
   track_script {
       chk_haproxy # 20 points
   }

#BACKUP
       track_script {
           chk_haproxy             # 10 points
           chk_ping_core_switch    # 10 points
           # if not core switch -> brain splitted
       }

在核心交換機恢復正常之前,使用此配置的備份節點沒有資格成為主節點。

引用自:https://serverfault.com/questions/904933