Drbd

Ganeti 磁碟降級 drbd cs:NetworkFailure

  • June 3, 2016

我在 Ganeti 上有一個實例(帶有 2 個磁碟),兩個磁碟都已降級(可能是由於連接問題?)。直到今天早上,這個實例多年來一直正常工作。

在我的主人

$ gnt-instance info myinstance
...
  -disk/0
     on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
     on secondary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
     child devices:
       - child 0: lvm, size 20.0G
         logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
         on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
         on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
       - child 1: lvm, size 128M
         logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
         on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
         on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

...

在主節點上

$ cat /proc/drbd
4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
   ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

在輔助節點上

$ cat /proc/drbd
4: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
   ns:0 nr:678340009 dw:678340009 dr:0 al:0 bm:14884 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

我無法重新啟動或關閉實例(操作超時)。

我認為這不是腦裂問題,因為沒有“獨立”,在主節點上它是“主要/未知”,而在輔助節點上是“次要/未知”。

我試圖在輔助節點上執行“drbdadm connect all”,但什麼也沒做。

我試圖更換磁碟,但失敗了:

gnt-instance replace-disks -s myinstance
Thu Jun  2 11:32:00 2016 Replacing disk(s) 0, 1 for myinstancel
Thu Jun  2 11:36:00 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=False, pass=1): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Thu Jun  2 11:38:01 2016  - WARNING: Could not prepare block device disk/0 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd4: cannot activate, unknown or unhandled reason
Thu Jun  2 11:40:02 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Failure: command execution error:
Disk consistency error

現在它看起來像這樣:

$ gnt-instance info myinstance
...
   -disk/0 
     on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
     (no more secondary)
     child devices:
       - child 0: lvm, size 20.0G
         logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
         on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
         on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
       - child 1: lvm, size 128M
         logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
         on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
         on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

在主節點上

$ cat /proc/drbd
4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
   ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

在輔助節點上:

$ cat /proc/drbd
...
4: cs:Unconfigured
5: cs:Unconfigured

知道如何解決這個問題嗎?

DRBD 版本:8.3.7

加內蒂版本:2.4.5

作業系統:Debian 6.0

稍微調查了一下,發現主節點上有kvm殭屍程序:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                      
17520 root    20   0     0    0    0 Z  613  0.0  13922:24 kvm <defunct> 

我不知道如何正確擺脫它。

我嘗試從該節點遷移所有主實例(我只有 2 個),但這失敗了(與 bdrm 相關的錯誤)。我重新啟動了節點。關機的時候,因為drbd卡住了。消息是這樣的:

No response from the DRBD driver! Is the module loaded?

所以我按下按鈕關閉機器。機器重新啟動(沒有任何錯誤),幾分鐘後,Ganeti 實例自動啟動。

在我執行的主節點上:

$ gnt-instance info myinstance
...
    on primary:   /dev/drbd4 (147:4) *RECOVERING* 12.80%, ETA 288s, status *DEGRADED*
    on secondary: /dev/drbd4 (147:4) *RECOVERING* 12.80%, ETA 275s, status *DEGRADED* *UNCERTAIN STATE*
....

等待幾分鐘後,恢復完成,現在它是同步的。

結論:現在一切正常,但我希望不必重新啟動節點。

感謝 gf_ 的幫助。

引用自:https://serverfault.com/questions/780428