Linux

mount.ocfs2:安裝時未連接傳輸端點…?

  • December 26, 2012

我已經用 OCFS2 替換了在雙主模式下執行的死節點。所有步驟都有效:

/proc/drbd

version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org, 2012-05-07 11:56:36

1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
   ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

直到我嘗試安裝卷:

mount -t ocfs2 /dev/drbd1 /data/webroot/
mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.

/var/log/kern.log

kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107
kernel: ocfs2: Unmounting device (147,1) on (node 1)

以下是節點 0 (192.168.3.145) 上的核心日誌:

kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0
kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93
:43868
kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107

我確定/etc/ocfs2/cluster.conf在兩個節點上是相同的:

/etc/ocfs2/cluster.conf

node:
   ip_port = 7777
   ip_address = 192.168.3.145
   number = 0
   name = SVR233NTC-3145.localdomain
   cluster = cpc

node:
   ip_port = 7777
   ip_address = 192.168.2.93
   number = 1
   name = SVR022-293.localdomain
   cluster = cpc

cluster:
   node_count = 2
   name = cpc

他們連接得很好:

# nc -z 192.168.3.145 7777
Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!

但 O2CB 心跳在新節點 (192.168.2.93) 上不活動:

/etc/init.d/o2cb status

Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster cpc: Online
Heartbeat dead threshold = 31
 Network idle timeout: 30000
 Network keepalive delay: 2000
 Network reconnect delay: 2000
Checking O2CB heartbeat: Not active

tcpdump以下是在節點 0 上執行並ocfs2在節點 1 上啟動時的結果:

 1   0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0
 2   0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180
 3   0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223
 4   0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223
 5   0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
 6   0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181

每 6個RST數據包發送一次標誌。

我還能做些什麼來調試這個案例?

PS:

節點 0 上的 OCFS2 版本:

  • ocfs2-tools-1.4.4-1.el5
  • ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5

節點 1 上的 OCFS2 版本:

  • ocfs2-tools-1.4.4-1.el5
  • ocfs2-2.6.18-308.el5-1.4.7-1.el5

更新 1 - 2012 年 12 月 23 日星期日 18:15:07 ICT

兩個節點是否在同一個區域網路段上?沒有路由器之類的?

不,它們是不同子網上的 2 個 VMWare 伺服器。

哦,雖然我記得 - 主機名/DNS 所有設置和工作正常?

當然,我將每個節點的主機名和 IP 地址都添加到/etc/hosts

192.168.2.93    SVR022-293.localdomain
192.168.3.145   SVR233NTC-3145.localdomain

他們可以通過主機名相互連接:

# nc -z SVR022-293.localdomain 7777
Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!

# nc -z SVR233NTC-3145.localdomain 7777
Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!

更新 2 - 2012 年 12 月 24 日星期一 18:32:15 ICT

找到了線索:我的同事/etc/ocfs2/cluster.conf在集群執行時手動編輯了文件。因此,它仍然將死節點資訊保留在/sys/kernel/config/cluster/

# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain

SVR150-4107.localdomain在這種情況下)

我將停止集群以移除死節點,但出現以下錯誤:

# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active

我確定該ocfs2服務已經停止:

# mounted.ocfs2 -f
Device                FS     Nodes
/dev/sdb              ocfs2  Not mounted
/dev/drbd1            ocfs2  Not mounted

沒有參考資料了:

# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26
12963EAF4E16484DB81ECB0251177C26: 0 refs

我還解除安裝了ocfs2核心模組以確保:

# ps -ef | grep [o]cfs2
root     12513    43  0 18:25 ?        00:00:00 [ocfs2_wq]

# modprobe -r ocfs2
# ps -ef | grep [o]cfs2
# lsof | grep ocfs2

但沒有任何改變:

# /etc/init.d/o2cb offline
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active

所以最後一個問題是:如何在不重啟的情況下刪除死節點資訊?


更新 3 - 2012 年 12 月 24 日星期一 22:41:51 ICT

以下是所有正在執行的心跳執行緒:

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'
drwxr-xr-x 2 root root    0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2

此心跳區域的引用計數:

# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs

嘗試殺死:

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

有任何想法嗎?

哦耶!問題解決了。

注意UUID:

# mounted.ocfs2 -d
Device                FS     Stack  UUID                              Label
/dev/sdb              ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1
/dev/drbd1            ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1

但:

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/
drwxr-xr-x 2 root root    0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2

這可能是因為我“意外”強制重新格式化了 OCFS2 卷。我面臨的問題與Ocfs2 -user 郵件列表中的問題類似。

這也是以下錯誤的原因:

ocfs2_hb_ctl:停止心跳時 ocfs2_lookup 未找到文件

因為在. ocfs2_hb_ctl_72EF09EA3D0D4F51BDC00B47432B1EB2``/proc/partitions

我想到了一個想法:我可以更改 OCFS2 卷的 UUID嗎?

查看tunefs.ocfs2手冊頁:

Usage: tunefs.ocfs2 [options] <device> [new-size]
      tunefs.ocfs2 -h|--help
      tunefs.ocfs2 -V|--version
[options] can be any mix of:
       -U|--uuid-reset[=new-uuid]

所以我執行以下命令:

# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1
WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. 
Having two OCFS2 file systems with the same UUID could, in the least, 
cause erratic behavior, and if unlucky, cause file system damage. 
Please choose the UUID with care.
Update the UUID ?yes

核實:

# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 
72EF09EA3D0D4F51BDC00B47432B1EB2

再次嘗試殺死心跳區域,看看會發生什麼:

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs

繼續殺戮,直到我看到0 refs然後關閉集群:

# /etc/init.d/o2cb offline cpc
Stopping O2CB cluster cpc: OK

並停止它:

# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: OK
Unloading module "ocfs2": OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK

重新啟動查看新節點是否更新:

# /etc/init.d/o2cb start
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster cpc: OK

# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain

好的,在對等節點(192.168.2.93)上,嘗試啟動 OCFS2:

# /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2)                [  OK  ]

感謝Sunil Mushran,因為這個文章幫助我解決了這個問題。

教訓是:

  1. IP地址,埠,…只能在集群離線時更改。請參閱 常見問題解答
  2. 切勿強制重新格式化 OCFS2 卷。

引用自:https://serverfault.com/questions/459957