當我拔下首選節點時資源是 UNCLEAN

July 29, 2019

我是 linux 網路配置的初學者
我通過 ssh + drbd + nginx 為 3 個節點配置了 linux 起搏器 + corosync + stonith。
個人電腦狀態：
3 nodes configured
7 resources configured

Online: [ main-node second-node third-node ]

Full list of resources:

ClusterIP      (ocf::heartbeat:IPaddr2):       Started main-node
WebSite        (ocf::heartbeat:nginx): Started main-node
Master/Slave Set: WebDataClone [WebData]
    Masters: [ main-node ]
    Slaves: [ second-node third-node ]
WebFS  (ocf::heartbeat:Filesystem):    Started main-node
ssh-fencing    (stonith:ssh):  Started third-node

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled
我只是通過從網路上拔下電纜來測試這些機器中的 stonith。它工作正常，當再次插入時，stonith 正在殺死未插電的機器。所有其他機器都在處理集群。
當我拔下首選提供網站資源的機器時出現問題。然後其他插入機器的 pcs 狀態如下所示：
3 nodes configured
7 resources configured

Node main-node: UNCLEAN (offline)
Online: [ second-node third-node ]

Full list of resources:

ClusterIP      (ocf::heartbeat:IPaddr2):       Started main-node (UNCLEAN)
WebSite        (ocf::heartbeat:nginx): Started main-node (UNCLEAN)
Master/Slave Set: WebDataClone [WebData]
    WebData    (ocf::linbit:drbd):     Master main-node (UNCLEAN)
    Slaves: [ second-node third-node ]
WebFS  (ocf::heartbeat:Filesystem):    Started main-node (UNCLEAN)
ssh-fencing    (stonith:ssh):  Started third-node

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled
並且網站關閉了。這是為什麼？其他節點不應該提供資源嗎？

SSH STONITH 不是真正的防護，不應在生產中使用，除非您接受它可能會讓您陷入某些類型的故障，就像您在測試中看到的那樣。
當您拔下節點的網路電纜時，集群將嘗試 STONITH 從集群/網路中消失的節點。SSH STONITH 代理正在使用您拔下的同一網路來嘗試關閉失去的節點。在網路恢復（重新插入）之前，它將無法做到這一點。由於在 STONITH 代理成功關閉失去的節點之前，集群不會執行任何操作（故障轉移），因此您將獲得 UNCLEAN（掛起）服務。
如果關閉主節點的電源，您將遇到同樣的問題，因為當系統沒有電源時，您無法通過 SSH 連接到系統。
簡而言之，這是使用 SSH STONITH 時的預期行為，並且需要適當的防護設備才能從您正在測試的場景中恢復。

引用自：https://serverfault.com/questions/977004

當我拔下首選節點時資源是 UNCLEAN

相關問答

PCSD 簡單主/從不會故障主切換

伺服器“未執行”(7) 上的 pcs 狀態錯誤 httpd_monitor_5000：

在 CentOS 7 上使用 Pacemaker 的 DRBDManage

編輯 HA 集群配置 cib.xml

Corosync :: 區域網路連接問題後重新啟動一些資源

在 corosync 起搏器故障轉移集群中啟動服務