存在故障節點時,心跳無法從冷啟動成功啟動資源
我目前有兩個執行 Heartbeat 和 DRBD 的 ubuntu 伺服器。伺服器通過 eth1 上的 1000Mbps 交叉電纜直接連接,並可以訪問 eth0 上的 IP 攝影機 LAN。
現在,假設一個節點已關閉,而剩餘的功能節點在關閉後正在啟動。仍在執行的節點不會啟動 heartbeat 並從冷啟動提供對 drbd 資源的訪問。我必須手動重新啟動 heartbeat
sudo service heartbeat restart
才能讓一切正常執行。當只有一台伺服器存在時,如何讓它從冷啟動開始正常?
這是
ha.cf
:debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility none keepalive 2 deadtime 10 warntime 7 initdead 60 ucast eth1 192.168.2.2 ucast eth0 10.1.10.201 node EMserver1 node EMserver2 respawn hacluster /usr/lib/heartbeat/ipfail ping 10.1.10.22 10.1.10.21 10.1.10.11 auto_failback off
系統日誌中的一些材料:
harc[4604]: 2012/11/27_13:54:49 info: Running /etc/ha.d//rc.d/status status mach_down[4632]: 2012/11/27_13:54:49 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[4632]: 2012/11/27_13:54:49 info: mach_down takeover complete for node emserver2. Nov 27 13:54:49 EMserver1 heartbeat: [4586]: info: Initial resource acquisition complete (T_RESOURCES(us)) Nov 27 13:54:49 EMserver1 heartbeat: [4586]: info: mach_down takeover complete. IPaddr[4679]: 2012/11/27_13:54:49 INFO: Resource is stopped Nov 27 13:54:49 EMserver1 heartbeat: [4605]: info: Local Resource acquisition completed. harc[4713]: 2012/11/27_13:54:49 info: Running /etc/ha.d//rc.d/ip-request-resp ip-request-resp ip-request-resp[4713]: 2012/11/27_13:54:49 received ip-request-resp IPaddr::10.1.10.254 OK yes ResourceManager[4732]: 2012/11/27_13:54:50 info: Acquiring resource group: emserver1 IPaddr::10.1.10.254 drbddisk::r0 Filesystem::/dev/drbd1::/shr::ext4 nfs-kernel-server IPaddr[4759]: 2012/11/27_13:54:50 INFO: Resource is stopped ResourceManager[4732]: 2012/11/27_13:54:50 info: Running /etc/ha.d/resource.d/IPaddr 10.1.10.254 start IPaddr[4816]: 2012/11/27_13:54:50 INFO: Using calculated nic for 10.1.10.254: eth0 IPaddr[4816]: 2012/11/27_13:54:50 INFO: Using calculated netmask for 10.1.10.254: 255.255.255.0 IPaddr[4816]: 2012/11/27_13:54:50 INFO: eval ifconfig eth0:0 10.1.10.254 netmask 255.255.255.0 broadcast 10.1.10.255 IPaddr[4804]: 2012/11/27_13:54:50 INFO: Success ResourceManager[4732]: 2012/11/27_13:54:50 info: Running /etc/ha.d/resource.d/drbddisk r0 start Filesystem[4965]: 2012/11/27_13:54:50 INFO: Resource is stopped ResourceManager[4732]: 2012/11/27_13:54:50 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 /shr ext4 start Filesystem[5039]: 2012/11/27_13:54:50 INFO: Running start for /dev/drbd1 on /shr Filesystem[5033]: 2012/11/27_13:54:51 INFO: Success ResourceManager[4732]: 2012/11/27_13:54:51 info: Running /etc/init.d/nfs-kernel-server start Nov 27 13:55:00 EMserver1 heartbeat: [4586]: info: Local Resource acquisition completed. (none) Nov 27 13:55:00 EMserver1 heartbeat: [4586]: info: local resource transition completed. Nov 27 13:57:46 EMserver1 heartbeat: [4586]: info: Heartbeat shutdown in progress. (4586) Nov 27 13:57:46 EMserver1 heartbeat: [5286]: info: Giving up all HA resources. ResourceManager[5301]: 2012/11/27_13:57:46 info: Releasing resource group: emserver1 IPaddr::10.1.10.254 drbddisk::r0 Filesystem::/dev/drbd1::/shr::ext4 nfs-kernel-server ResourceManager[5301]: 2012/11/27_13:57:46 info: Running /etc/init.d/nfs-kernel-server stop ResourceManager[5301]: 2012/11/27_13:57:46 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 /shr ext4 stop Filesystem[5372]: 2012/11/27_13:57:46 INFO: Running stop for /dev/drbd1 on /shr Filesystem[5372]: 2012/11/27_13:57:47 INFO: Trying to unmount /shr Filesystem[5372]: 2012/11/27_13:57:47 INFO: unmounted /shr successfully Filesystem[5366]: 2012/11/27_13:57:47 INFO: Success ResourceManager[5301]: 2012/11/27_13:57:47 info: Running /etc/ha.d/resource.d/drbddisk r0 stop ResourceManager[5301]: 2012/11/27_13:57:47 info: Running /etc/ha.d/resource.d/IPaddr 10.1.10.254 stop IPaddr[5509]: 2012/11/27_13:57:47 INFO: ifconfig eth0:0 down IPaddr[5497]: 2012/11/27_13:57:47 INFO: Success Nov 27 13:57:47 EMserver1 heartbeat: [5286]: info: All HA resources relinquished. Nov 27 13:57:48 EMserver1 heartbeat: [4586]: info: killing /usr/lib/heartbeat/ipfail process group 4603 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBFIFO process 4589 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBWRITE process 4590 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBREAD process 4591 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBWRITE process 4592 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBREAD process 4593 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBWRITE process 4594 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBREAD process 4595 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBWRITE process 4596 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBREAD process 4597 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBWRITE process 4598 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: killing HBREAD process 4599 with signal 15 Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4589 exited. 11 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4596 exited. 10 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4598 exited. 9 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4590 exited. 8 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4595 exited. 7 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4591 exited. 6 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4592 exited. 5 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4593 exited. 4 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4597 exited. 3 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4594 exited. 2 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: Core process 4599 exited. 1 remaining Nov 27 13:57:49 EMserver1 heartbeat: [4586]: info: emserver1 Heartbeat shutdown complete.
這是日誌中的更多內容
ResourceManager[2576]: 2012/11/28_16:32:42 info: Acquiring resource group: emserver1 IPaddr::10.1.10.254 drbddisk::r0 Filesystem::/dev/drbd1::/shr::ext4 nfs-kernel-server IPaddr[2602]: 2012/11/28_16:32:42 INFO: Running OK Filesystem[2653]: 2012/11/28_16:32:43 INFO: Running OK Nov 28 16:32:52 EMserver1 heartbeat: [1695]: WARN: node emserver2: is dead Nov 28 16:32:52 EMserver1 heartbeat: [1695]: info: Dead node emserver2 gave up resources. Nov 28 16:32:52 EMserver1 ipfail: [1807]: info: Status update: Node emserver2 now has status dead Nov 28 16:32:52 EMserver1 heartbeat: [1695]: info: Link emserver2:eth1 dead. Nov 28 16:32:53 EMserver1 ipfail: [1807]: info: NS: We are still alive! Nov 28 16:32:53 EMserver1 ipfail: [1807]: info: Link Status update: Link emserver2/eth1 now has status dead Nov 28 16:32:55 EMserver1 ipfail: [1807]: info: Asking other side for ping node count. Nov 28 16:32:55 EMserver1 ipfail: [1807]: info: Checking remote count of ping nodes. Nov 28 16:32:57 EMserver1 heartbeat: [1695]: info: Heartbeat shutdown in progress. (1695) Nov 28 16:32:57 EMserver1 heartbeat: [2734]: info: Giving up all HA resources. ResourceManager[2751]: 2012/11/28_16:32:57 info: Releasing resource group: emserver1 IPaddr::10.1.10.254 drbddisk::r0 Filesystem::/dev/drbd1::/shr::ext4 nfs-kernel-server ResourceManager[2751]: 2012/11/28_16:32:57 info: Running /etc/init.d/nfs-kernel-server stop ResourceManager[2751]: 2012/11/28_16:32:57 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 /shr ext4 stop Filesystem[2829]: 2012/11/28_16:32:57 INFO: Running stop for /dev/drbd1 on /shr Filesystem[2829]: 2012/11/28_16:32:57 INFO: Trying to unmount /shr Filesystem[2829]: 2012/11/28_16:32:58 INFO: unmounted /shr successfully Filesystem[2823]: 2012/11/28_16:32:58 INFO: Success ResourceManager[2751]: 2012/11/28_16:32:58 info: Running /etc/ha.d/resource.d/drbddisk r0 stop ResourceManager[2751]: 2012/11/28_16:32:58 info: Running /etc/ha.d/resource.d/IPaddr 10.1.10.254 stop IPaddr[2971]: 2012/11/28_16:32:58 INFO: ifconfig eth0:0 down IPaddr[2958]: 2012/11/28_16:32:58 INFO: Success Nov 28 16:32:58 EMserver1 heartbeat: [2734]: info: All HA resources relinquished. Nov 28 16:32:59 EMserver1 heartbeat: [1695]: info: killing /usr/lib/heartbeat/ipfail process group 1807 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBFIFO process 1777 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBWRITE process 1778 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBREAD process 1779 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBWRITE process 1780 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBREAD process 1781 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBWRITE process 1782 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBREAD process 1783 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBWRITE process 1784 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBREAD process 1785 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBWRITE process 1786 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: killing HBREAD process 1787 with signal 15 Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1778 exited. 11 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1779 exited. 10 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1780 exited. 9 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1781 exited. 8 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1782 exited. 7 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1783 exited. 6 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1784 exited. 5 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1785 exited. 4 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1786 exited. 3 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1787 exited. 2 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: Core process 1777 exited. 1 remaining Nov 28 16:33:01 EMserver1 heartbeat: [1695]: info: emserver1 Heartbeat shutdown complete.
如果此時我重新啟動心跳…資源心跳控制項將正常啟動…請幫助!
正如評論中的命令,這裡是我在回答形式中的初步建議。
如果您執行足夠新的 Ubuntu 版本(10.04 或更高版本),您應該完全放棄 Heartbeat 並改用 Corosync 和 Pacemaker。雖然這不是您特定問題的答案或解決方案,但正如您所見,它會使您的問題簡單地消失。
Linux Journal 上寫了一篇關於 Pacemaker的精彩介紹性文章, Cluster from Scratch是啟動和執行 Corosync 和 Pacemaker 的一個很好的起點。對於 Ubuntu 10.04,請使用 PPA 作為官方穩定的反向移植集群堆棧:https ://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa
舊的 Heartbeat 堆棧有許多概念上的缺陷,並且已經過時了這麼久,現在不值得嘗試調試如果您今天使用它必然會出現的問題。Corosync 和 Pacemaker 正是出於這個原因而創建的,它們的工作原理比 Heartbeat 好得多,而 Heartbeat 甚至不是資源管理器。