起搏器故障超時不重置故障計數

October 5, 2016

我在 Centos7 上使用 Pacemaker 1.1.13 和 Corosync 2.3.4。
我的主/從資源有問題。我的資源有元屬性：
遷移門檻值=1
失敗超時=10s
但是當資源出現故障時，只有一次嘗試啟動它。文件說屬性 failure-timeout=10s 應該每 10 秒重置一次失敗計數，但這不會發生，因此資源永遠不會啟動。
你知道這個問題嗎？也許我做錯了什麼？我在下面發送我的“個人電腦狀態”：
Cluster Name: webcluster
Corosync Nodes:
10.121.100.101 10.121.100.102
Pacemaker Nodes:
pm-node1 pm-node2

Resources:
Master: Services-master
 Meta Attrs: failure-timeout=10s
 Group: Services
  Meta Attrs: migration-threshold=1
  Resource: Test (class=ocf provider=scooty type=test)
   Operations: start interval=0s timeout=20 (Test-start-interval-0s)
               stop interval=0s timeout=20 (Test-stop-interval-0s)
               monitor interval=10 role=Master timeout=20 (Test-monitor-interval-10)
               monitor interval=11 role=Slave timeout=20 (Test-monitor-interval-11)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
migration-threshold: 1
failure-timeout: 10
Operations Defaults:
No defaults set

Cluster Properties:
cluster-infrastructure: corosync
cluster-name: webcluster
dc-version: 1.1.13-10.el7_2.4-44eb2dd
have-watchdog: false
last-lrm-refresh: 1475145002
no-quorum-policy: ignore
start-failure-is-fatal: false
stonith-enabled: false

根據故障的類型，failure-timeout可能不足以清理它。啟動和停止操作失敗被認為是“致命的”，並且不會被故障超時自動清除。
如果您遇到啟動操作失敗的問題，您可以設置集群屬性start-failure-is-fatal=false。Fencing/STONITH 設備是從停止故障中恢復的唯一方法。
希望有幫助。

引用自：https://serverfault.com/questions/806093

起搏器故障超時不重置故障計數

相關問答

沒有流量通過 openstack 浮動 IP 和起搏器/corosync 到達

Heartbeat、Pacemaker 和 CoroSync 的替代品？

Linux HA 集群：以非 root 使用者身份執行資源

Corosync :: 區域網路連接問題後重新啟動一些資源

NFS 故障轉移在遷移資源時因文件句柄過時而失敗

corosync 綁定到 127.0.0.1 而不是正確的介面