起搏器無法啟動，因為重複節點但無法刪除重複節點，因為起搏器無法啟動

December 20, 2019

好的！對起搏器/corosync 來說真的很新，比如 1 天新。

軟體：Ubuntu 18.04 LTS 以及與該發行版相關的版本。

起搏器：1.1.18

同步：2.4.3

我不小心從整個測試集群中刪除了節點（3 個節點）

當我嘗試使用 GUI 恢復所有內容時pcsd，由於節點被“清除”而失敗。涼爽的。

所以。corosync.conf我從“主”節點獲得了最後一個副本。我複製到其他兩個節點。我修復bindnetaddr了各自的confs。我pcs cluster start在我的“主”節點上執行。

其中一個節點未能啟動。我查看了pacemaker該節點上的狀態，並收到以下異常：

Dec 18 06:33:56 region-ctrl-2 crmd[1049]:     crit: Nodes 1084777441 and 2 share the same name 'region-ctrl-2': shutting down

我嘗試在無法啟動crm_node -R --force 1084777441的機器上執行，但當然，它沒有執行，所以我得到一個錯誤。因此，我在其中一個健康節點上執行了相同的命令，它沒有顯示任何錯誤，但該節點永遠不會消失，並且在受影響的機器上繼續顯示相同的錯誤。pacemaker``pacemaker``crmd: connection refused (111)``pacemaker

所以，我決定一次又一次地拆除整個集群。我從機器上清除了所有的包。我重新安裝了一切新鮮的。我複制並修復corosync.conf了機器。我重新創建了集群。我得到了完全相同的血腥錯誤。

所以這個命名1084777441的節點不是我創建的機器。這是為我創建的集群之一。當天早些時候，我意識到我使用的是 IP 地址corosync.conf而不是名稱。我修復了/etc/hosts機器，從 corosync 配置中刪除了 IP 地址，這就是為什麼我一開始無意中刪除了我的整個集群（我刪除了作為 IP 地址的節點）。

以下是我的 corosync.conf：

totem {
   version: 2
   cluster_name: maas-cluster
   token: 3000
   token_retransmits_before_loss_const: 10
   clear_node_high_bit: yes
   crypto_cipher: none
   crypto_hash: none

   interface {
       ringnumber: 0
       bindnetaddr: 192.168.99.225
       mcastport: 5405
       ttl: 1
   }
}

logging {
   fileline: off
   to_stderr: no
   to_logfile: no
   to_syslog: yes
   syslog_facility: daemon
   debug: off
   timestamp: on

   logger_subsys {
       subsys: QUORUM
       debug: off
   }
}

quorum {
   provider: corosync_votequorum
   expected_votes: 3
   two_node: 1
}

nodelist {
   node {
       ring0_addr: postgres-sb
       nodeid: 3
   }

   node {
       ring0_addr: region-ctrl-2
       nodeid: 2
   }

   node {
       ring0_addr: region-ctrl-1
       nodeid: 1
   }
}

節點之間的這個 conf 唯一不同的是bindnetaddr.

這裡似乎存在雞/蛋問題，除非我不知道有某種方法可以從某處的平面文件數據庫或 sqlite 數據庫中刪除節點，或者有其他更權威的方法可以從集群中刪除節點。

額外的

我已經確保/etc/hosts每台機器的主機名都匹配。我忘了提那個。

127.0.0.1 localhost
127.0.1.1 postgres
192.168.99.224 postgres-sb
192.168.99.223 region-ctrl-1
192.168.99.225 region-ctrl-2

192.168.7.224 postgres-sb
192.168.7.223 region-ctrl-1
192.168.7.225 region-ctrl-2


# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

我決定嘗試從頭開始。我apt removed --purge編輯corosync*,pacemaker* crmsh和pcs. 我rm -rf編/etc/corosync。corosync.conf我在每台機器上都保留了一份副本。

我在每台機器上重新安裝了所有東西。我將保存的內容複製corosync.conf到/etc/corosync/並corosync在所有機器上重新啟動。

我仍然得到同樣的錯誤。這必須是其中一個組件中的錯誤！

因此，似乎crm_get_peer無法辨識名為的主機region-ctrl-2在corosync.conf. 然後節點 2 會自動分配一個 ID 1084777441。這對我來說沒有意義。機器的主機名region-ctrl-2設置在/etc/hostname並/etc/hosts使用uname -n. corosync.conf正在顯式地為命名的機器分配一個 ID，但region-ctrl-2某些東西顯然無法辨識來自該主機的分配corosync，而是為該主機分配了一個值為 1084777441 的非隨機 ID。我怎麼解決這個問題？

日誌

   info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
   info: get_cluster_type:     Detected an active 'corosync' cluster
   info: qb_ipcs_us_publish:   server name: pacemakerd
   info: pcmk__ipc_is_authentic_process_active:        Could not connect to lrmd IPC: Connection refused
   info: pcmk__ipc_is_authentic_process_active:        Could not connect to cib_ro IPC: Connection refused
   info: pcmk__ipc_is_authentic_process_active:        Could not connect to crmd IPC: Connection refused
   info: pcmk__ipc_is_authentic_process_active:        Could not connect to attrd IPC: Connection refused
   info: pcmk__ipc_is_authentic_process_active:        Could not connect to pengine IPC: Connection refused
   info: pcmk__ipc_is_authentic_process_active:        Could not connect to stonith-ng IPC: Connection refused
   info: corosync_node_name:   Unable to get node name for nodeid 1084777441
 notice: get_node_name:        Could not obtain a node name for corosync nodeid 1084777441
   info: crm_get_peer: Created entry ea4ec23e-e676-4798-9b8b-00af39d3bb3d/0x5555f74984d0 for node (null)/1084777441 (1 total)
   info: crm_get_peer: Node 1084777441 has uuid 1084777441
   info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
 notice: cluster_connect_quorum:       Quorum acquired
   info: crm_get_peer: Created entry 882c0feb-d546-44b7-955f-4c8a844a0db1/0x5555f7499fd0 for node postgres-sb/3 (2 total)
   info: crm_get_peer: Node 3 is now known as postgres-sb
   info: crm_get_peer: Node 3 has uuid 3
   info: crm_get_peer: Created entry 4e6a6b1e-d687-4527-bffc-5d701ff60a66/0x5555f749a6f0 for node region-ctrl-2/2 (3 total)
   info: crm_get_peer: Node 2 is now known as region-ctrl-2
   info: crm_get_peer: Node 2 has uuid 2
   info: crm_get_peer: Created entry 5532a3cc-2577-4764-b9ee-770d437ccec0/0x5555f749a0a0 for node region-ctrl-1/1 (4 total)
   info: crm_get_peer: Node 1 is now known as region-ctrl-1
   info: crm_get_peer: Node 1 has uuid 1
   info: corosync_node_name:   Unable to get node name for nodeid 1084777441
 notice: get_node_name:        Defaulting to uname -n for the local corosync node name
warning: crm_find_peer:        Node 1084777441 and 2 share the same name: 'region-ctrl-2'
   info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
   info: pcmk_quorum_notification:     Quorum retained | membership=32 members=3
 notice: crm_update_peer_state_iter:   Node region-ctrl-1 state is now member | nodeid=1 previous=unknown source=pcmk_quorum_notification
 notice: crm_update_peer_state_iter:   Node postgres-sb state is now member | nodeid=3 previous=unknown source=pcmk_quorum_notification
 notice: crm_update_peer_state_iter:   Node region-ctrl-2 state is now member | nodeid=1084777441 previous=unknown source=pcmk_quorum_notification
   info: crm_reap_unseen_nodes:        State of node region-ctrl-2[2] is still unknown
   info: pcmk_cpg_membership:  Node 1084777441 joined group pacemakerd (counter=0.0, pid=32765, unchecked for rivals)
   info: pcmk_cpg_membership:  Node 1 still member of group pacemakerd (peer=region-ctrl-1:900, counter=0.0, at least once)
   info: crm_update_peer_proc: pcmk_cpg_membership: Node region-ctrl-1[1] - corosync-cpg is now online
   info: pcmk_cpg_membership:  Node 3 still member of group pacemakerd (peer=postgres-sb:976, counter=0.1, at least once)
   info: crm_update_peer_proc: pcmk_cpg_membership: Node postgres-sb[3] - corosync-cpg is now online
   info: pcmk_cpg_membership:  Node 1084777441 still member of group pacemakerd (peer=region-ctrl-2:3016, counter=0.2, at least once)
 pengine:     info: crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
    lrmd:     info: crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
    lrmd:     info: qb_ipcs_us_publish:        server name: lrmd
 pengine:     info: qb_ipcs_us_publish:        server name: pengine
     cib:     info: crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
   attrd:     info: crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
   attrd:     info: get_cluster_type:  Verifying cluster type: 'corosync'
   attrd:     info: get_cluster_type:  Assuming an active 'corosync' cluster
   info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
   attrd:   notice: crm_cluster_connect:       Connecting to cluster infrastructure: corosync
     cib:     info: get_cluster_type:  Verifying cluster type: 'corosync'
     cib:     info: get_cluster_type:  Assuming an active 'corosync' cluster
   info: get_cluster_type:     Verifying cluster type: 'corosync'
   info: get_cluster_type:     Assuming an active 'corosync' cluster
 notice: crm_cluster_connect:  Connecting to cluster infrastructure: corosync
   attrd:     info: corosync_node_name:        Unable to get node name for nodeid 1084777441
     cib:     info: validate_with_relaxng:     Creating RNG parser context
    crmd:     info: crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
    crmd:     info: get_cluster_type:  Verifying cluster type: 'corosync'
    crmd:     info: get_cluster_type:  Assuming an active 'corosync' cluster
    crmd:     info: do_log:    Input I_STARTUP received in state S_STARTING from crmd_init
   attrd:   notice: get_node_name:     Could not obtain a node name for corosync nodeid 1084777441
   attrd:     info: crm_get_peer:      Created entry af5c62c9-21c5-4428-9504-ea72a92de7eb/0x560870420e90 for node (null)/1084777441 (1 total)
   attrd:     info: crm_get_peer:      Node 1084777441 has uuid 1084777441
   attrd:     info: crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
   attrd:   notice: crm_update_peer_state_iter:        Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
   attrd:     info: init_cs_connection_once:   Connection to 'corosync': established
   info: corosync_node_name:   Unable to get node name for nodeid 1084777441
 notice: get_node_name:        Could not obtain a node name for corosync nodeid 1084777441
   info: crm_get_peer: Created entry 5bcb51ae-0015-4652-b036-b92cf4f1d990/0x55f583634700 for node (null)/1084777441 (1 total)
   info: crm_get_peer: Node 1084777441 has uuid 1084777441
   info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
 notice: crm_update_peer_state_iter:   Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
   attrd:     info: corosync_node_name:        Unable to get node name for nodeid 1084777441
   attrd:   notice: get_node_name:     Defaulting to uname -n for the local corosync node name
   attrd:     info: crm_get_peer:      Node 1084777441 is now known as region-ctrl-2
   info: corosync_node_name:   Unable to get node name for nodeid 1084777441
 notice: get_node_name:        Defaulting to uname -n for the local corosync node name
   info: init_cs_connection_once:      Connection to 'corosync': established
   info: corosync_node_name:   Unable to get node name for nodeid 1084777441
 notice: get_node_name:        Defaulting to uname -n for the local corosync node name
   info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
     cib:   notice: crm_cluster_connect:       Connecting to cluster infrastructure: corosync
     cib:     info: corosync_node_name:        Unable to get node name for nodeid 1084777441
     cib:   notice: get_node_name:     Could not obtain a node name for corosync nodeid 1084777441
     cib:     info: crm_get_peer:      Created entry a6ced2c1-9d51-445d-9411-2fb19deab861/0x55848365a150 for node (null)/1084777441 (1 total)
     cib:     info: crm_get_peer:      Node 1084777441 has uuid 1084777441
     cib:     info: crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
     cib:   notice: crm_update_peer_state_iter:        Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
     cib:     info: init_cs_connection_once:   Connection to 'corosync': established
     cib:     info: corosync_node_name:        Unable to get node name for nodeid 1084777441
     cib:   notice: get_node_name:     Defaulting to uname -n for the local corosync node name
     cib:     info: crm_get_peer:      Node 1084777441 is now known as region-ctrl-2
     cib:     info: qb_ipcs_us_publish:        server name: cib_ro
     cib:     info: qb_ipcs_us_publish:        server name: cib_rw
     cib:     info: qb_ipcs_us_publish:        server name: cib_shm
     cib:     info: pcmk_cpg_membership:       Node 1084777441 joined group cib (counter=0.0, pid=0, unchecked for rivals)

在使用了 clusterlabs 之後，我找到了解決這個問題的方法。/etc/corosync/corosync.conf通過添加指令並確保在指令中正確添加所有節點來transport: udpu修復該修復。如果僅按名稱使用節點，則需要確保節點可正確解析，這通常在. 修復後，重新啟動整個集群。就我而言，以下是固定版本：totem``nodelist``/etc/hosts``corosync.conf``corosync.conf
totem {
   version: 2
   cluster_name: maas-cluster
   token: 3000
   token_retransmits_before_loss_const: 10
   clear_node_high_bit: yes
   crypto_cipher: none
   crypto_hash: none
   transport: udpu

   interface {
       ringnumber: 0
       bindnetaddr: 192.168.99.0
       mcastport: 5405
       ttl: 1
   }
}

logging {
   fileline: off
   to_stderr: no
   to_logfile: no
   to_syslog: yes
   syslog_facility: daemon
   debug: off
   timestamp: on

   logger_subsys {
       subsys: QUORUM
       debug: off
   }
}

quorum {
   provider: corosync_votequorum
   expected_votes: 3
   two_node: 1
}

nodelist {
   node {
       ring0_addr: postgres-sb
       nodeid: 3
   }

   node {
       ring0_addr: region-ctrl-2
       nodeid: 2
   }

   node {
       ring0_addr: region-ctrl-1
       nodeid: 1
   }
}

引用自：https://serverfault.com/questions/995981

起搏器無法啟動，因為重複節點但無法刪除重複節點，因為起搏器無法啟動

相關問答

Heartbeat、Pacemaker 和 CoroSync 的替代品？

SQL Server 2017 在帶有 Pacemaker 故障轉移的 Ubuntu 上啟動時崩潰

編輯 HA 集群配置 cib.xml

快速查明起搏器/corosync 是否有 quorum/is quorate

起搏器故障超時不重置故障計數

Linux HA 集群：以非 root 使用者身份執行資源