Debian

zpool 卡在重新同步循環中

  • December 8, 2019

我有以下 zpool:

   NAME                        STATE     READ WRITE CKSUM
   zfspool                     ONLINE       0     0     0
     mirror-0                  ONLINE       0     0     0
       wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
       wwn-0x5000cca266f1ae00  ONLINE       0     0     0

今天早上主持人經歷了一個事件(仍在深入研究。負載非常高,很多東西都不起作用,但我仍然可以進入它)。

重新啟動時,主機在啟動期間掛起,等待依賴於上述池中數據的服務。

懷疑池有問題,我卸下了其中一個驅動器並再次重新啟動。主持人這次上線了。

擦洗顯示現有磁碟上的所有數據都很好。完成後,我重新插入了被移除的驅動器。驅動器開始重新同步,但完成了大約 4%,然後重新啟動。

smartctl 顯示任何一個驅動器都沒有問題(沒有記錄錯誤,WHEN_FAILED 為空)。

但是,我不知道哪個磁碟正在重新同步,實際上看起來池很好,根本不需要重新同步。

errors: No known data errors
root@host1:/var/log# zpool status
 pool: zfspool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
       continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Dec  8 12:20:53 2019
       46.7G scanned at 15.6G/s, 45.8G issued at 15.3G/s, 5.11T total
       0B resilvered, 0.87% done, 0 days 00:05:40 to go
config:

       NAME                        STATE     READ WRITE CKSUM
       zfspool                     ONLINE       0     0     0
         mirror-0                  ONLINE       0     0     0
           wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
           wwn-0x5000cca266f1ae00  ONLINE       0     0     0

errors: No known data errors

擺脫這種重新同步循環的最佳途徑是什麼?其他答案建議分離正在重新同步的驅動器,但就像我說的那樣,它看起來不像任何一個。

編輯:

zpool events 大約是以下 1000 次重複:

Dec  8 2019 13:22:12.493980068 sysevent.fs.zfs.resilver_start
       version = 0x0
       class = "sysevent.fs.zfs.resilver_start"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       time = 0x5ded4d64 0x1d7189a4
       eid = 0xf89

Dec  8 2019 13:22:12.493980068 sysevent.fs.zfs.history_event
       version = 0x0
       class = "sysevent.fs.zfs.history_event"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       history_hostname = "host1"
       history_internal_str = "func=2 mintxg=7381953 maxtxg=9049388"
       history_internal_name = "scan setup"
       history_txg = 0x8a192e
       history_time = 0x5ded4d64
       time = 0x5ded4d64 0x1d7189a4
       eid = 0xf8a

Dec  8 2019 13:22:17.485979213 sysevent.fs.zfs.history_event
       version = 0x0
       class = "sysevent.fs.zfs.history_event"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       history_hostname = "host1"
       history_internal_str = "errors=0"
       history_internal_name = "scan aborted, restarting"
       history_txg = 0x8a192f
       history_time = 0x5ded4d69
       time = 0x5ded4d69 0x1cf7744d
       eid = 0xf8b

Dec  8 2019 13:22:17.733979170 sysevent.fs.zfs.history_event
       version = 0x0
       class = "sysevent.fs.zfs.history_event"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       history_hostname = "host1"
       history_internal_str = "errors=0"
       history_internal_name = "starting deferred resilver"
       history_txg = 0x8a192f
       history_time = 0x5ded4d69
       time = 0x5ded4d69 0x2bbfa222
       eid = 0xf8c

Dec  8 2019 13:22:17.733979170 sysevent.fs.zfs.resilver_start
       version = 0x0
       class = "sysevent.fs.zfs.resilver_start"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       time = 0x5ded4d69 0x2bbfa222
       eid = 0xf8d

...

現在已解決。

github上的以下問題提供了答案:

https://github.com/zfsonlinux/zfs/issues/9551

在這種情況下,危險信號可能是快速循環的"starting deferred resilver"事件,如zpool events -v

連結中的第一個建議是禁用 zfs-zed 服務。就我而言,它一開始就沒有啟用。

第二個建議是驗證 zpool 是否啟動了 defer_resilver 功能。如果在未啟用與該升級相對應的功能的情況下升級池,似乎存在潛在問題。在過去 2 年左右的時間裡,這個池已經從多台機器/作業系統轉移,因此它可能是在舊版本的 ZFS 中創建的,並且在最新主機上的新版本 ZFS 上是有道理的:

root@host1:/# zpool get all | grep feature
...
zfspool  feature@resilver_defer         disabled                       local
...

看到這個後,我啟用了該功能。github 連結似乎表明這很危險,因此請確保您有備份。

root@host1:/# zpool set feature@resilver_defer=enabled zfspool

之後,zpool status 顯示 resilver 比以前更進一步:

root@host1:/# zpool status
 pool: zfspool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
       continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Dec  8 13:53:43 2019
       847G scanned at 2.03G/s, 396G issued at 969M/s, 5.11T total
       0B resilvered, 7.56% done, 0 days 01:25:14 to go
config:

       NAME                        STATE     READ WRITE CKSUM
       zfspool                     ONLINE       0     0     0
         mirror-0                  ONLINE       0     0     0
           wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
           wwn-0x5000cca266f1ae00  ONLINE       0     0     0

errors: No known data errors

引用自:https://serverfault.com/questions/994806