Centos

CentOS6.5上HA不斷切換

  • October 4, 2017

我在 HA 組中的 centOS 上有 2 台伺服器 Asterisk 與 Peacemaker。集群處於主動被動模式並配置了資源組。問題是從一段時間以來伺服器從主伺服器切換到輔助伺服器,對我來說沒有明顯的原因。日誌以:

PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 327: pingd=0

隨後停止資源組和

Oct  3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: Trying to unmount /drbd
Oct  3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: unmounted /drbd successfully
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation DrbdFS_stop_0 (call=2046, rc=0, cib-update=2735, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 32: demote drbd_drbd0_demote_0 on pbx1(local)
Oct  3 19:30:39 PBX1 kernel: block drbd0: role( Primary -> Secondary ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct  3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_demote_0 (call=2050, rc=0, cib-update=2736, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 159: notify drbd_drbd0_post_notify_demote_0 on pbx2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 161: notify drbd_drbd0_post_notify_demote_0 on pbx1 (local)
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2053, rc=0, cib-update=0, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 156: notify drbd_drbd0_pre_notify_stop_0 on pbx2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 157: notify drbd_drbd0_pre_notify_stop_0 on pbx1 (local)
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2056, rc=0, cib-update=0, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 31: stop drbd_drbd0_stop_0 on pbx2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 33: stop drbd_drbd0_stop_0 on pbx1 (local)
Oct  3 19:30:39 PBX1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: asender terminated
Oct  3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_asender
Oct  3 19:30:39 PBX1 kernel: block drbd0: Connection closed
Oct  3 19:30:39 PBX1 kernel: block drbd0: conn( Disconnecting -> StandAlone ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: receiver terminated
Oct  3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_receiver
Oct  3 19:30:39 PBX1 kernel: block drbd0: disk( UpToDate -> Failed ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct  3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct  3 19:30:39 PBX1 kernel: block drbd0: disk( Failed -> Diskless ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 0
Oct  3 19:30:39 PBX1 kernel: block drbd0: worker terminated
Oct  3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_worker
Oct  3 19:30:39 PBX1 attrd[2503]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (<null>)
Oct  3 19:30:39 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent delete 329: node=pbx1, attr=master-drbd_drbd0, id=<n/a>, set=(null), section=status
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_stop_0 (call=2059, rc=0, cib-update=2737, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: run_graph: Transition 2112 (Complete=63, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3294.bz2): Stopped
Oct  3 19:30:39 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:39 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2113: /var/lib/pacemaker/pengine/pe-input-3295.bz2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: run_graph: Transition 2113 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3295.bz2): Complete
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 333: pingd=1000
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: LogActions: Start   drbd_drbd0:0#011(pbx2)
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2114: /var/lib/pacemaker/pengine/pe-input-3296.bz2
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: LogActions: Start   drbd_drbd0:0#011(pbx1)
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: LogActions: Start   drbd_drbd0:1#011(pbx2)
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2115: /var/lib/pacemaker/pengine/pe-input-3297.bz2
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 7: start drbd_drbd0_start_0 on pbx1 (local)
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 9: start drbd_drbd0:1_start_0 on pbx2
Oct  3 19:30:41 PBX1 kernel: block drbd0: Starting worker thread (from cqueue [2046])
Oct  3 19:30:41 PBX1 kernel: block drbd0: disk( Diskless -> Attaching ) 
Oct  3 19:30:41 PBX1 kernel: block drbd0: Found 4 transactions (192 active extents) in activity log.
Oct  3 19:30:41 PBX1 kernel: block drbd0: Method to ensure write ordering: flush
Oct  3 19:30:41 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 28411864
Oct  3 19:30:41 PBX1 kernel: block drbd0: resync bitmap: bits=3551483 words=55492 pages=109
Oct  3 19:30:41 PBX1 kernel: block drbd0: size = 14 GB (14205932 KB)
Oct  3 19:30:41 PBX1 kernel: block drbd0: bitmap READ of 109 pages took 13 jiffies
Oct  3 19:30:41 PBX1 kernel: block drbd0: recounting of set bits took additional 1 jiffies
Oct  3 19:30:41 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct  3 19:30:41 PBX1 kernel: block drbd0: disk( Attaching -> UpToDate ) 
Oct  3 19:30:41 PBX1 kernel: block drbd0: attached to UUIDs 760BE94B0D6CDF8A:0000000000000000:819257FD75678FC9:819157FD75678FC9
Oct  3 19:30:41 PBX1 kernel: block drbd0: conn( StandAlone -> Unconnected ) 
Oct  3 19:30:41 PBX1 kernel: block drbd0: Starting receiver thread (from drbd0_worker [14408])
Oct  3 19:30:41 PBX1 kernel: block drbd0: receiver (re)started
Oct  3 19:30:41 PBX1 kernel: block drbd0: conn( Unconnected -> WFConnection ) 
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (1000)
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 336: master-drbd_drbd0=1000
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 338: master-drbd_drbd0=1000
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_start_0 (call=2062, rc=0, cib-update=2741, confirmed=true) ok
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 106: notify drbd_drbd0_post_notify_start_0 on pbx1 (local)
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 107: notify drbd_drbd0:1_post_notify_start_0 on pbx2
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2065, rc=0, cib-update=0, confirmed=true) ok
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: run_graph: Transition 2115 (Complete=10, Pending=0, Fired=0, Skipped=2, Incompl  

從 pcs config ping 的部分是

    Attributes: drbd_resource=drbd0
  Operations: monitor interval=10s (drbd_drbd0-monitor-interval-10s)
Clone: Connectivity
 Resource: p_ping (class=ocf provider=pacemaker type=ping)
  Attributes: host_list="10.66.9.4 10.66.9.5 10.66.9.11 10.66.9.252 10.66.9.253 10.66.9.254" multiplier=1000 dampen=5s
  Operations: monitor interval=1s (p_ping-monitor-interval-1s)

重要資訊是這些地址中只有 10.66.9.11 還活著(我計劃清除無效的地址。但這是另一回事),因為一年前發生了重新設計的女巫,但從一兩個月開始就可以看到問題max.Could it from that or please say if you have any other想法。 PS 電纜很好,根據為監控配置的仙人掌女巫,它們不會失去 icmp 到 10.66.9.11 介面

Oct  3 19:18:31 PBX1 asterisk(asterisk)[1338]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:19:01 PBX1 asterisk(asterisk)[1960]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:19:14 PBX1 Xinetd(Xinetd)[2200]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:19:32 PBX1 asterisk(asterisk)[2347]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:20:02 PBX1 asterisk(asterisk)[2994]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:20:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:20:32 PBX1 asterisk(asterisk)[3369]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:21:02 PBX1 asterisk(asterisk)[3970]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:21:08 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct  3 19:21:08 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:21:08 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2110: /var/lib/pacemaker/pengine/pe-input-3292.bz2
Oct  3 19:21:08 PBX1 crmd[2505]:   notice: run_graph: Transition 2110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3292.bz2): Complete
Oct  3 19:21:08 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct  3 19:21:14 PBX1 Xinetd(Xinetd)[4176]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:21:32 PBX1 asterisk(asterisk)[4318]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:22:02 PBX1 asterisk(asterisk)[4989]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:22:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:22:32 PBX1 asterisk(asterisk)[5352]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:23:02 PBX1 asterisk(asterisk)[5939]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:23:14 PBX1 Xinetd(Xinetd)[6160]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:23:32 PBX1 asterisk(asterisk)[6302]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:24:02 PBX1 asterisk(asterisk)[6933]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:24:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:24:32 PBX1 asterisk(asterisk)[7276]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:25:02 PBX1 asterisk(asterisk)[7895]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:25:14 PBX1 Xinetd(Xinetd)[8118]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:25:32 PBX1 asterisk(asterisk)[8259]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/A/IN': 2a01:8840:6::1#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/AAAA/IN': 2a01:8840:6::1#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct  3 19:26:02 PBX1 asterisk(asterisk)[8873]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:26:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:26:32 PBX1 asterisk(asterisk)[9233]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:27:02 PBX1 asterisk(asterisk)[9841]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:27:14 PBX1 Xinetd(Xinetd)[10088]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:27:32 PBX1 asterisk(asterisk)[10230]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:28:02 PBX1 asterisk(asterisk)[10862]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:28:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:28:32 PBX1 asterisk(asterisk)[11224]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:29:03 PBX1 asterisk(asterisk)[11810]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:29:14 PBX1 Xinetd(Xinetd)[12031]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:29:33 PBX1 asterisk(asterisk)[12174]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:30:03 PBX1 asterisk(asterisk)[12818]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:30:23 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 327: pingd=0
Oct  3 19:30:23 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    drbd_drbd0:0#011(pbx2..bg)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2111: /var/lib/pacemaker/pengine/pe-input-3293.bz2
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    drbd_drbd0:0#011(pbx2..bg)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Demote  drbd_drbd0:1#011(Master -> Stopped pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusPBXIP#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusNetIP#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusPBXIP-R#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusPBXIP-R2#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    Fixr_ClusPBXIP#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R5#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R6#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R7#011(pbx1)

PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0那種把它送人。這裡 pingd 得分為 0(無 ping)。然後它顯示它停止了所有資源(10 月 3 日 19:30:23)。

後來我們看到 pingd 分數達到 1000,然後服務再次重新啟動:PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000(10 月 3 日 19:30:41)。

我沒有完整的 Pacemaker CIB,但我懷疑您有一些位置/託管規則說資源無法在 pingd 分數等於 0 的地方執行。因此,當您的 pingd 分數達到零時,停止集群範圍內的資源,然後在 ping 時重新啟動再次返回。雖然資源可能會在對等節點上啟動,但我不認為這是故障轉移問題,而是由於 pingd 資源和規則而導致服務停止和重新啟動。

雖然您確實提到仙人掌中的 ping 沒有問題:

PS 電纜很好,根據為監控配置的仙人掌女巫,它們不會失去 icmp 到 10.66.9.11 介面

仙人掌多久監控一次?如果只有 5 秒的 ping 沒有返回,Pacemaker 就會做出反應。此外,仙人掌數據可能只是從仙人掌伺服器到 10.66.9.11 伺服器的 ping。這並不一定意味著 PBX 集群和 10.66.9.11 系統之間沒有問題。

我建議要麼向 host_list 添加更多 ping 節點,增加 p_ping 的監視間隔和抑制時間,或者完全取消 p_ping 資源和規則。

引用自:https://serverfault.com/questions/876829