Cluster
PCSD 簡單主/從不會故障主切換
我正在嘗試編寫一個簡單的起搏器主/從系統。我創建了一個代理,它的元數據如下:
elm_meta_data() { cat <<EOF <?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> <resource-agent name="elm-agent"> <version>0.1</version> <longdesc lang="en"> Resource agent for ELM high availability clusters. </longdesc> <shortdesc> Resource agent for ELM </shortdesc> <parameters> <parameter name="datadir" unique="0" required="1"> <longdesc lang="en"> Data directory </longdesc> <shortdesc lang="en">Data directory</shortdesc> <content type="string"/> </parameter> </parameters> <actions> <action name="start" timeout="35" /> <action name="stop" timeout="35" /> <action name="monitor" timeout="35" interval="10" depth="0" /> <action name="monitor" timeout="35" interval="10" depth="0" role="Master" /> <action name="monitor" timeout="35" interval="11" depth="0" role="Slave" /> <action name="reload" timeout="70" /> <action name="meta-data" timeout="5" /> <action name="promote" timeout="20" /> <action name="demote" timeout="20" /> <action name="validate-all" timeout="20" /> <action name="notify" timeout="20" /> </actions> </resource-agent> EOF }
我的監控,提升,降級是:
elm_monitor() { local elm_running local worker_running local is_master elm_running=0 worker_running=0 is_master=0 if [ -e "${OCF_RESKEY_datadir}/master.conf" ]; then is_master=1 fi if [ "$(docker ps -q -f name=elm_web)" ]; then elm_running=1 fi if [ "$(docker ps -q -f name=elm_worker)" ]; then worker_running=1 fi if [ $elm_running -ne $worker_running ]; then if [ $is_master -eq 1 ]; then exit $OCF_FAILED_MASTER fi exit $OCF_ERR_GENERIC fi if [ $elm_running -eq 0 ]; then return $OCF_NOT_RUNNING fi ... if [ $is_master -eq 1 ]; then exit $OCF_FAILED_MASTER fi exit $OCF_ERR_GENERIC } elm_promote() { touch ${OCF_RESKEY_datadir}/master.conf return $OCF_SUCCESS } elm_demote() { rm ${OCF_RESKEY_datadir}/master.conf return $OCF_SUCCESS }
如果我使用以下 cib 命令配置集群,它會得到三個從屬伺服器並且沒有主伺服器:
sudo pcs cluster cib cluster1.xml sudo pcs -f cluster1.xml resource create elmd ocf:a10:elm \ datadir="/etc/a10/elm" \ op start timeout=90s \ op stop timeout=90s \ op promote timeout=60s \ op demote timeout=60s \ op monitor interval=15s timeout=35s role="Master" \ op monitor interval=16s timeout=35s role="Slave" \ op notify timeout=60s sudo pcs -f cluster1.xml resource master elm-ha elmd notify=true sudo pcs -f cluster1.xml resource create ClusterIP ocf:heartbeat:IPaddr2 ip=$vip cidr_netmask=$net_mask op monitor interval=10s sudo pcs -f cluster1.xml constraint colocation add ClusterIP with master elm-ha INFINITY sudo pcs -f cluster1.xml constraint order promote elm-ha then start ClusterIP symmetrical=false kind=Mandatory sudo pcs -f cluster1.xml constraint order demote elm-ha then stop ClusterIP symmetrical=false kind=Mandatory sudo pcs cluster cib-push cluster1.xml ubuntu@elm1:~$ sudo pcs status ... elm_proxmox_fence100 (stonith:fence_pve): Started elm1 elm_proxmox_fence101 (stonith:fence_pve): Started elm2 elm_proxmox_fence103 (stonith:fence_pve): Started elm3 Master/Slave Set: elm-ha [elmd] Slaves: [ elm1 elm2 elm3 ] ClusterIP (ocf::heartbeat:IPaddr2): Stopped
而如果我將以下命令添加到 cib,我會得到一個主/從設置:
sudo pcs -f cluster1.xml constraint location elm-ha rule role=master \#uname eq $(hostname) Master/Slave Set: elm-ha [elmd] Masters: [ elm1 ] Slaves: [ elm2 elm3 ] ClusterIP (ocf::heartbeat:IPaddr2): Started elm1
但是在最後一個版本上,大師似乎堅持使用 elm1。當我測試失敗時,通過停止主伺服器上的 corosync 服務,我最終得到了 2 個從伺服器,主伺服器處於停止狀態。我猜測設置規則是強制起搏器將主控保持在 elm1 上。
Master/Slave Set: elm-ha [elmd] Slaves: [ elm2 elm3 ] Stopped: [ elm1 ] ClusterIP (ocf::heartbeat:IPaddr2): Stopped
如何配置它,以便當我發送我的 cib 命令時,它會選擇一個主伺服器並在主伺服器出現故障時進行故障轉移?我的代理需要一些不同的東西嗎?
我終於在文件中找到了答案。我未能在 monitor() 中設置主首選項屬性
crm_master -l reboot -v 100