起搏器無法啟動,因為重複節點但無法刪除重複節點,因為起搏器無法啟動
好的!對起搏器/corosync 來說真的很新,比如 1 天新。
軟體:Ubuntu 18.04 LTS 以及與該發行版相關的版本。
起搏器:1.1.18
同步:2.4.3
我不小心從整個測試集群中刪除了節點(3 個節點)
當我嘗試使用 GUI 恢復所有內容時
pcsd
,由於節點被“清除”而失敗。涼爽的。所以。
corosync.conf
我從“主”節點獲得了最後一個副本。我複製到其他兩個節點。我修復bindnetaddr
了各自的confs。我pcs cluster start
在我的“主”節點上執行。其中一個節點未能啟動。我查看了
pacemaker
該節點上的狀態,並收到以下異常:Dec 18 06:33:56 region-ctrl-2 crmd[1049]: crit: Nodes 1084777441 and 2 share the same name 'region-ctrl-2': shutting down
我嘗試在無法啟動
crm_node -R --force 1084777441
的機器上執行,但當然,它沒有執行,所以我得到一個錯誤。因此,我在其中一個健康節點上執行了相同的命令,它沒有顯示任何錯誤,但該節點永遠不會消失,並且在受影響的機器上繼續顯示相同的錯誤。pacemaker``pacemaker``crmd: connection refused (111)``pacemaker
所以,我決定一次又一次地拆除整個集群。我從機器上清除了所有的包。我重新安裝了一切新鮮的。我複制並修復
corosync.conf
了機器。我重新創建了集群。我得到了完全相同的血腥錯誤。所以這個命名
1084777441
的節點不是我創建的機器。這是為我創建的集群之一。當天早些時候,我意識到我使用的是 IP 地址corosync.conf
而不是名稱。我修復了/etc/hosts
機器,從 corosync 配置中刪除了 IP 地址,這就是為什麼我一開始無意中刪除了我的整個集群(我刪除了作為 IP 地址的節點)。以下是我的 corosync.conf:
totem { version: 2 cluster_name: maas-cluster token: 3000 token_retransmits_before_loss_const: 10 clear_node_high_bit: yes crypto_cipher: none crypto_hash: none interface { ringnumber: 0 bindnetaddr: 192.168.99.225 mcastport: 5405 ttl: 1 } } logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 3 two_node: 1 } nodelist { node { ring0_addr: postgres-sb nodeid: 3 } node { ring0_addr: region-ctrl-2 nodeid: 2 } node { ring0_addr: region-ctrl-1 nodeid: 1 } }
節點之間的這個 conf 唯一不同的是
bindnetaddr
.這裡似乎存在雞/蛋問題,除非我不知道有某種方法可以從某處的平面文件數據庫或 sqlite 數據庫中刪除節點,或者有其他更權威的方法可以從集群中刪除節點。
額外的
我已經確保
/etc/hosts
每台機器的主機名都匹配。我忘了提那個。127.0.0.1 localhost 127.0.1.1 postgres 192.168.99.224 postgres-sb 192.168.99.223 region-ctrl-1 192.168.99.225 region-ctrl-2 192.168.7.224 postgres-sb 192.168.7.223 region-ctrl-1 192.168.7.225 region-ctrl-2 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
我決定嘗試從頭開始。我
apt removed --purge
編輯corosync*
,pacemaker*
crmsh
和pcs
. 我rm -rf
編/etc/corosync
。corosync.conf
我在每台機器上都保留了一份副本。我在每台機器上重新安裝了所有東西。我將保存的內容複製
corosync.conf
到/etc/corosync/
並corosync
在所有機器上重新啟動。我仍然得到同樣的錯誤。這必須是其中一個組件中的錯誤!
因此,似乎
crm_get_peer
無法辨識名為的主機region-ctrl-2
在corosync.conf
. 然後節點 2 會自動分配一個 ID 1084777441。這對我來說沒有意義。機器的主機名region-ctrl-2
設置在/etc/hostname
並/etc/hosts
使用uname -n
.corosync.conf
正在顯式地為命名的機器分配一個 ID,但region-ctrl-2
某些東西顯然無法辨識來自該主機的分配corosync
,而是為該主機分配了一個值為 1084777441 的非隨機 ID。我怎麼解決這個問題?日誌
info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores info: get_cluster_type: Detected an active 'corosync' cluster info: qb_ipcs_us_publish: server name: pacemakerd info: pcmk__ipc_is_authentic_process_active: Could not connect to lrmd IPC: Connection refused info: pcmk__ipc_is_authentic_process_active: Could not connect to cib_ro IPC: Connection refused info: pcmk__ipc_is_authentic_process_active: Could not connect to crmd IPC: Connection refused info: pcmk__ipc_is_authentic_process_active: Could not connect to attrd IPC: Connection refused info: pcmk__ipc_is_authentic_process_active: Could not connect to pengine IPC: Connection refused info: pcmk__ipc_is_authentic_process_active: Could not connect to stonith-ng IPC: Connection refused info: corosync_node_name: Unable to get node name for nodeid 1084777441 notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441 info: crm_get_peer: Created entry ea4ec23e-e676-4798-9b8b-00af39d3bb3d/0x5555f74984d0 for node (null)/1084777441 (1 total) info: crm_get_peer: Node 1084777441 has uuid 1084777441 info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online notice: cluster_connect_quorum: Quorum acquired info: crm_get_peer: Created entry 882c0feb-d546-44b7-955f-4c8a844a0db1/0x5555f7499fd0 for node postgres-sb/3 (2 total) info: crm_get_peer: Node 3 is now known as postgres-sb info: crm_get_peer: Node 3 has uuid 3 info: crm_get_peer: Created entry 4e6a6b1e-d687-4527-bffc-5d701ff60a66/0x5555f749a6f0 for node region-ctrl-2/2 (3 total) info: crm_get_peer: Node 2 is now known as region-ctrl-2 info: crm_get_peer: Node 2 has uuid 2 info: crm_get_peer: Created entry 5532a3cc-2577-4764-b9ee-770d437ccec0/0x5555f749a0a0 for node region-ctrl-1/1 (4 total) info: crm_get_peer: Node 1 is now known as region-ctrl-1 info: crm_get_peer: Node 1 has uuid 1 info: corosync_node_name: Unable to get node name for nodeid 1084777441 notice: get_node_name: Defaulting to uname -n for the local corosync node name warning: crm_find_peer: Node 1084777441 and 2 share the same name: 'region-ctrl-2' info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2 info: pcmk_quorum_notification: Quorum retained | membership=32 members=3 notice: crm_update_peer_state_iter: Node region-ctrl-1 state is now member | nodeid=1 previous=unknown source=pcmk_quorum_notification notice: crm_update_peer_state_iter: Node postgres-sb state is now member | nodeid=3 previous=unknown source=pcmk_quorum_notification notice: crm_update_peer_state_iter: Node region-ctrl-2 state is now member | nodeid=1084777441 previous=unknown source=pcmk_quorum_notification info: crm_reap_unseen_nodes: State of node region-ctrl-2[2] is still unknown info: pcmk_cpg_membership: Node 1084777441 joined group pacemakerd (counter=0.0, pid=32765, unchecked for rivals) info: pcmk_cpg_membership: Node 1 still member of group pacemakerd (peer=region-ctrl-1:900, counter=0.0, at least once) info: crm_update_peer_proc: pcmk_cpg_membership: Node region-ctrl-1[1] - corosync-cpg is now online info: pcmk_cpg_membership: Node 3 still member of group pacemakerd (peer=postgres-sb:976, counter=0.1, at least once) info: crm_update_peer_proc: pcmk_cpg_membership: Node postgres-sb[3] - corosync-cpg is now online info: pcmk_cpg_membership: Node 1084777441 still member of group pacemakerd (peer=region-ctrl-2:3016, counter=0.2, at least once) pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores lrmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores lrmd: info: qb_ipcs_us_publish: server name: lrmd pengine: info: qb_ipcs_us_publish: server name: pengine cib: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores attrd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores attrd: info: get_cluster_type: Verifying cluster type: 'corosync' attrd: info: get_cluster_type: Assuming an active 'corosync' cluster info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores attrd: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync cib: info: get_cluster_type: Verifying cluster type: 'corosync' cib: info: get_cluster_type: Assuming an active 'corosync' cluster info: get_cluster_type: Verifying cluster type: 'corosync' info: get_cluster_type: Assuming an active 'corosync' cluster notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync attrd: info: corosync_node_name: Unable to get node name for nodeid 1084777441 cib: info: validate_with_relaxng: Creating RNG parser context crmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores crmd: info: get_cluster_type: Verifying cluster type: 'corosync' crmd: info: get_cluster_type: Assuming an active 'corosync' cluster crmd: info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init attrd: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441 attrd: info: crm_get_peer: Created entry af5c62c9-21c5-4428-9504-ea72a92de7eb/0x560870420e90 for node (null)/1084777441 (1 total) attrd: info: crm_get_peer: Node 1084777441 has uuid 1084777441 attrd: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online attrd: notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc attrd: info: init_cs_connection_once: Connection to 'corosync': established info: corosync_node_name: Unable to get node name for nodeid 1084777441 notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441 info: crm_get_peer: Created entry 5bcb51ae-0015-4652-b036-b92cf4f1d990/0x55f583634700 for node (null)/1084777441 (1 total) info: crm_get_peer: Node 1084777441 has uuid 1084777441 info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc attrd: info: corosync_node_name: Unable to get node name for nodeid 1084777441 attrd: notice: get_node_name: Defaulting to uname -n for the local corosync node name attrd: info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2 info: corosync_node_name: Unable to get node name for nodeid 1084777441 notice: get_node_name: Defaulting to uname -n for the local corosync node name info: init_cs_connection_once: Connection to 'corosync': established info: corosync_node_name: Unable to get node name for nodeid 1084777441 notice: get_node_name: Defaulting to uname -n for the local corosync node name info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2 cib: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync cib: info: corosync_node_name: Unable to get node name for nodeid 1084777441 cib: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441 cib: info: crm_get_peer: Created entry a6ced2c1-9d51-445d-9411-2fb19deab861/0x55848365a150 for node (null)/1084777441 (1 total) cib: info: crm_get_peer: Node 1084777441 has uuid 1084777441 cib: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online cib: notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc cib: info: init_cs_connection_once: Connection to 'corosync': established cib: info: corosync_node_name: Unable to get node name for nodeid 1084777441 cib: notice: get_node_name: Defaulting to uname -n for the local corosync node name cib: info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2 cib: info: qb_ipcs_us_publish: server name: cib_ro cib: info: qb_ipcs_us_publish: server name: cib_rw cib: info: qb_ipcs_us_publish: server name: cib_shm cib: info: pcmk_cpg_membership: Node 1084777441 joined group cib (counter=0.0, pid=0, unchecked for rivals)
在使用了 clusterlabs 之後,我找到了解決這個問題的方法。
/etc/corosync/corosync.conf
通過添加指令並確保在指令中正確添加所有節點來transport: udpu
修復該修復。如果僅按名稱使用節點,則需要確保節點可正確解析,這通常在. 修復後,重新啟動整個集群。就我而言,以下是固定版本:totem``nodelist``/etc/hosts``corosync.conf``corosync.conf
totem { version: 2 cluster_name: maas-cluster token: 3000 token_retransmits_before_loss_const: 10 clear_node_high_bit: yes crypto_cipher: none crypto_hash: none transport: udpu interface { ringnumber: 0 bindnetaddr: 192.168.99.0 mcastport: 5405 ttl: 1 } } logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 3 two_node: 1 } nodelist { node { ring0_addr: postgres-sb nodeid: 3 } node { ring0_addr: region-ctrl-2 nodeid: 2 } node { ring0_addr: region-ctrl-1 nodeid: 1 } }