CentOS6.5上HA不斷切換
我在 HA 組中的 centOS 上有 2 台伺服器 Asterisk 與 Peacemaker。集群處於主動被動模式並配置了資源組。問題是從一段時間以來伺服器從主伺服器切換到輔助伺服器,對我來說沒有明顯的原因。日誌以:
PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
隨後停止資源組和
Oct 3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: Trying to unmount /drbd Oct 3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: unmounted /drbd successfully Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation DrbdFS_stop_0 (call=2046, rc=0, cib-update=2735, confirmed=true) ok Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 32: demote drbd_drbd0_demote_0 on pbx1(local) Oct 3 19:30:39 PBX1 kernel: block drbd0: role( Primary -> Secondary ) Oct 3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies Oct 3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_demote_0 (call=2050, rc=0, cib-update=2736, confirmed=true) ok Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 159: notify drbd_drbd0_post_notify_demote_0 on pbx2 Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 161: notify drbd_drbd0_post_notify_demote_0 on pbx1 (local) Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2053, rc=0, cib-update=0, confirmed=true) ok Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 156: notify drbd_drbd0_pre_notify_stop_0 on pbx2 Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 157: notify drbd_drbd0_pre_notify_stop_0 on pbx1 (local) Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2056, rc=0, cib-update=0, confirmed=true) ok Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 31: stop drbd_drbd0_stop_0 on pbx2 Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 33: stop drbd_drbd0_stop_0 on pbx1 (local) Oct 3 19:30:39 PBX1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Oct 3 19:30:39 PBX1 kernel: block drbd0: asender terminated Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_asender Oct 3 19:30:39 PBX1 kernel: block drbd0: Connection closed Oct 3 19:30:39 PBX1 kernel: block drbd0: conn( Disconnecting -> StandAlone ) Oct 3 19:30:39 PBX1 kernel: block drbd0: receiver terminated Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_receiver Oct 3 19:30:39 PBX1 kernel: block drbd0: disk( UpToDate -> Failed ) Oct 3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies Oct 3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 3 19:30:39 PBX1 kernel: block drbd0: disk( Failed -> Diskless ) Oct 3 19:30:39 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 0 Oct 3 19:30:39 PBX1 kernel: block drbd0: worker terminated Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_worker Oct 3 19:30:39 PBX1 attrd[2503]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (<null>) Oct 3 19:30:39 PBX1 attrd[2503]: notice: attrd_perform_update: Sent delete 329: node=pbx1, attr=master-drbd_drbd0, id=<n/a>, set=(null), section=status Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_stop_0 (call=2059, rc=0, cib-update=2737, confirmed=true) ok Oct 3 19:30:39 PBX1 crmd[2505]: notice: run_graph: Transition 2112 (Complete=63, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3294.bz2): Stopped Oct 3 19:30:39 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 3 19:30:39 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2113: /var/lib/pacemaker/pengine/pe-input-3295.bz2 Oct 3 19:30:39 PBX1 crmd[2505]: notice: run_graph: Transition 2113 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3295.bz2): Complete Oct 3 19:30:39 PBX1 crmd[2505]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000 Oct 3 19:30:41 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 3 19:30:41 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:0#011(pbx2) Oct 3 19:30:41 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2114: /var/lib/pacemaker/pengine/pe-input-3296.bz2 Oct 3 19:30:41 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:0#011(pbx1) Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:1#011(pbx2) Oct 3 19:30:41 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2115: /var/lib/pacemaker/pengine/pe-input-3297.bz2 Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 7: start drbd_drbd0_start_0 on pbx1 (local) Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 9: start drbd_drbd0:1_start_0 on pbx2 Oct 3 19:30:41 PBX1 kernel: block drbd0: Starting worker thread (from cqueue [2046]) Oct 3 19:30:41 PBX1 kernel: block drbd0: disk( Diskless -> Attaching ) Oct 3 19:30:41 PBX1 kernel: block drbd0: Found 4 transactions (192 active extents) in activity log. Oct 3 19:30:41 PBX1 kernel: block drbd0: Method to ensure write ordering: flush Oct 3 19:30:41 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 28411864 Oct 3 19:30:41 PBX1 kernel: block drbd0: resync bitmap: bits=3551483 words=55492 pages=109 Oct 3 19:30:41 PBX1 kernel: block drbd0: size = 14 GB (14205932 KB) Oct 3 19:30:41 PBX1 kernel: block drbd0: bitmap READ of 109 pages took 13 jiffies Oct 3 19:30:41 PBX1 kernel: block drbd0: recounting of set bits took additional 1 jiffies Oct 3 19:30:41 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 3 19:30:41 PBX1 kernel: block drbd0: disk( Attaching -> UpToDate ) Oct 3 19:30:41 PBX1 kernel: block drbd0: attached to UUIDs 760BE94B0D6CDF8A:0000000000000000:819257FD75678FC9:819157FD75678FC9 Oct 3 19:30:41 PBX1 kernel: block drbd0: conn( StandAlone -> Unconnected ) Oct 3 19:30:41 PBX1 kernel: block drbd0: Starting receiver thread (from drbd0_worker [14408]) Oct 3 19:30:41 PBX1 kernel: block drbd0: receiver (re)started Oct 3 19:30:41 PBX1 kernel: block drbd0: conn( Unconnected -> WFConnection ) Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (1000) Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 336: master-drbd_drbd0=1000 Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 338: master-drbd_drbd0=1000 Oct 3 19:30:41 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_start_0 (call=2062, rc=0, cib-update=2741, confirmed=true) ok Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 106: notify drbd_drbd0_post_notify_start_0 on pbx1 (local) Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 107: notify drbd_drbd0:1_post_notify_start_0 on pbx2 Oct 3 19:30:41 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2065, rc=0, cib-update=0, confirmed=true) ok Oct 3 19:30:41 PBX1 crmd[2505]: notice: run_graph: Transition 2115 (Complete=10, Pending=0, Fired=0, Skipped=2, Incompl
從 pcs config ping 的部分是
Attributes: drbd_resource=drbd0 Operations: monitor interval=10s (drbd_drbd0-monitor-interval-10s) Clone: Connectivity Resource: p_ping (class=ocf provider=pacemaker type=ping) Attributes: host_list="10.66.9.4 10.66.9.5 10.66.9.11 10.66.9.252 10.66.9.253 10.66.9.254" multiplier=1000 dampen=5s Operations: monitor interval=1s (p_ping-monitor-interval-1s)
重要資訊是這些地址中只有 10.66.9.11 還活著(我計劃清除無效的地址。但這是另一回事),因為一年前發生了重新設計的女巫,但從一兩個月開始就可以看到問題max.Could it from that or please say if you have any other想法。 PS 電纜很好,根據為監控配置的仙人掌女巫,它們不會失去 icmp 到 10.66.9.11 介面
Oct 3 19:18:31 PBX1 asterisk(asterisk)[1338]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:19:01 PBX1 asterisk(asterisk)[1960]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:19:14 PBX1 Xinetd(Xinetd)[2200]: INFO: checking "disable" in /etc/xinetd.d/tftp Oct 3 19:19:32 PBX1 asterisk(asterisk)[2347]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:20:02 PBX1 asterisk(asterisk)[2994]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:20:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1 Oct 3 19:20:32 PBX1 asterisk(asterisk)[3369]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:21:02 PBX1 asterisk(asterisk)[3970]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:21:08 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Oct 3 19:21:08 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 3 19:21:08 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2110: /var/lib/pacemaker/pengine/pe-input-3292.bz2 Oct 3 19:21:08 PBX1 crmd[2505]: notice: run_graph: Transition 2110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3292.bz2): Complete Oct 3 19:21:08 PBX1 crmd[2505]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 3 19:21:14 PBX1 Xinetd(Xinetd)[4176]: INFO: checking "disable" in /etc/xinetd.d/tftp Oct 3 19:21:32 PBX1 asterisk(asterisk)[4318]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:22:02 PBX1 asterisk(asterisk)[4989]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:22:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1 Oct 3 19:22:32 PBX1 asterisk(asterisk)[5352]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:23:02 PBX1 asterisk(asterisk)[5939]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:23:14 PBX1 Xinetd(Xinetd)[6160]: INFO: checking "disable" in /etc/xinetd.d/tftp Oct 3 19:23:32 PBX1 asterisk(asterisk)[6302]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:24:02 PBX1 asterisk(asterisk)[6933]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:24:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1 Oct 3 19:24:32 PBX1 asterisk(asterisk)[7276]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:25:02 PBX1 asterisk(asterisk)[7895]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:25:14 PBX1 Xinetd(Xinetd)[8118]: INFO: checking "disable" in /etc/xinetd.d/tftp Oct 3 19:25:32 PBX1 asterisk(asterisk)[8259]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/A/IN': 2a01:8840:6::1#53 Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/AAAA/IN': 2a01:8840:6::1#53 Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/A/IN': 2001:dcd:4::2#53 Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/A/IN': 2001:dcd:4::2#53 Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53 Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53 Oct 3 19:26:02 PBX1 asterisk(asterisk)[8873]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:26:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1 Oct 3 19:26:32 PBX1 asterisk(asterisk)[9233]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:27:02 PBX1 asterisk(asterisk)[9841]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:27:14 PBX1 Xinetd(Xinetd)[10088]: INFO: checking "disable" in /etc/xinetd.d/tftp Oct 3 19:27:32 PBX1 asterisk(asterisk)[10230]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:28:02 PBX1 asterisk(asterisk)[10862]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:28:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1 Oct 3 19:28:32 PBX1 asterisk(asterisk)[11224]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:29:03 PBX1 asterisk(asterisk)[11810]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:29:14 PBX1 Xinetd(Xinetd)[12031]: INFO: checking "disable" in /etc/xinetd.d/tftp Oct 3 19:29:33 PBX1 asterisk(asterisk)[12174]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:30:03 PBX1 asterisk(asterisk)[12818]: INFO: 0 active channels 0 active calls 5999 calls processed Oct 3 19:30:23 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0 Oct 3 19:30:23 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 3 19:30:23 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop drbd_drbd0:0#011(pbx2..bg) Oct 3 19:30:23 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2111: /var/lib/pacemaker/pengine/pe-input-3293.bz2 Oct 3 19:30:23 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop drbd_drbd0:0#011(pbx2..bg) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Demote drbd_drbd0:1#011(Master -> Stopped pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusNetIP#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP-R#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP-R2#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop Fixr_ClusPBXIP#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R5#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R6#011(pbx1) Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R7#011(pbx1)
PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
那種把它送人。這裡 pingd 得分為 0(無 ping)。然後它顯示它停止了所有資源(10 月 3 日 19:30:23)。後來我們看到 pingd 分數達到 1000,然後服務再次重新啟動:
PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000
(10 月 3 日 19:30:41)。我沒有完整的 Pacemaker CIB,但我懷疑您有一些位置/託管規則說資源無法在 pingd 分數等於 0 的地方執行。因此,當您的 pingd 分數達到零時,停止集群範圍內的資源,然後在 ping 時重新啟動再次返回。雖然資源可能會在對等節點上啟動,但我不認為這是故障轉移問題,而是由於 pingd 資源和規則而導致服務停止和重新啟動。
雖然您確實提到仙人掌中的 ping 沒有問題:
PS 電纜很好,根據為監控配置的仙人掌女巫,它們不會失去 icmp 到 10.66.9.11 介面
仙人掌多久監控一次?如果只有 5 秒的 ping 沒有返回,Pacemaker 就會做出反應。此外,仙人掌數據可能只是從仙人掌伺服器到 10.66.9.11 伺服器的 ping。這並不一定意味著 PBX 集群和 10.66.9.11 系統之間沒有問題。
我建議要麼向 host_list 添加更多 ping 節點,增加 p_ping 的監視間隔和抑制時間,或者完全取消 p_ping 資源和規則。