zpool 卡在重新同步循環中

December 8, 2019

我有以下 zpool：

   NAME                        STATE     READ WRITE CKSUM
   zfspool                     ONLINE       0     0     0
     mirror-0                  ONLINE       0     0     0
       wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
       wwn-0x5000cca266f1ae00  ONLINE       0     0     0

今天早上主持人經歷了一個事件（仍在深入研究。負載非常高，很多東西都不起作用，但我仍然可以進入它）。

重新啟動時，主機在啟動期間掛起，等待依賴於上述池中數據的服務。

懷疑池有問題，我卸下了其中一個驅動器並再次重新啟動。主持人這次上線了。

擦洗顯示現有磁碟上的所有數據都很好。完成後，我重新插入了被移除的驅動器。驅動器開始重新同步，但完成了大約 4%，然後重新啟動。

smartctl 顯示任何一個驅動器都沒有問題（沒有記錄錯誤，WHEN_FAILED 為空）。

但是，我不知道哪個磁碟正在重新同步，實際上看起來池很好，根本不需要重新同步。

errors: No known data errors
root@host1:/var/log# zpool status
 pool: zfspool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
       continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Dec  8 12:20:53 2019
       46.7G scanned at 15.6G/s, 45.8G issued at 15.3G/s, 5.11T total
       0B resilvered, 0.87% done, 0 days 00:05:40 to go
config:

       NAME                        STATE     READ WRITE CKSUM
       zfspool                     ONLINE       0     0     0
         mirror-0                  ONLINE       0     0     0
           wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
           wwn-0x5000cca266f1ae00  ONLINE       0     0     0

errors: No known data errors

擺脫這種重新同步循環的最佳途徑是什麼？其他答案建議分離正在重新同步的驅動器，但就像我說的那樣，它看起來不像任何一個。

編輯：

zpool events 大約是以下 1000 次重複：

Dec  8 2019 13:22:12.493980068 sysevent.fs.zfs.resilver_start
       version = 0x0
       class = "sysevent.fs.zfs.resilver_start"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       time = 0x5ded4d64 0x1d7189a4
       eid = 0xf89

Dec  8 2019 13:22:12.493980068 sysevent.fs.zfs.history_event
       version = 0x0
       class = "sysevent.fs.zfs.history_event"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       history_hostname = "host1"
       history_internal_str = "func=2 mintxg=7381953 maxtxg=9049388"
       history_internal_name = "scan setup"
       history_txg = 0x8a192e
       history_time = 0x5ded4d64
       time = 0x5ded4d64 0x1d7189a4
       eid = 0xf8a

Dec  8 2019 13:22:17.485979213 sysevent.fs.zfs.history_event
       version = 0x0
       class = "sysevent.fs.zfs.history_event"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       history_hostname = "host1"
       history_internal_str = "errors=0"
       history_internal_name = "scan aborted, restarting"
       history_txg = 0x8a192f
       history_time = 0x5ded4d69
       time = 0x5ded4d69 0x1cf7744d
       eid = 0xf8b

Dec  8 2019 13:22:17.733979170 sysevent.fs.zfs.history_event
       version = 0x0
       class = "sysevent.fs.zfs.history_event"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       history_hostname = "host1"
       history_internal_str = "errors=0"
       history_internal_name = "starting deferred resilver"
       history_txg = 0x8a192f
       history_time = 0x5ded4d69
       time = 0x5ded4d69 0x2bbfa222
       eid = 0xf8c

Dec  8 2019 13:22:17.733979170 sysevent.fs.zfs.resilver_start
       version = 0x0
       class = "sysevent.fs.zfs.resilver_start"
       pool = "zfspool"
       pool_guid = 0x990e3eff72d0c352
       pool_state = 0x0
       pool_context = 0x0
       time = 0x5ded4d69 0x2bbfa222
       eid = 0xf8d

...

現在已解決。
github上的以下問題提供了答案：
https://github.com/zfsonlinux/zfs/issues/9551
在這種情況下，危險信號可能是快速循環的"starting deferred resilver"事件，如zpool events -v
連結中的第一個建議是禁用 zfs-zed 服務。就我而言，它一開始就沒有啟用。
第二個建議是驗證 zpool 是否啟動了 defer_resilver 功能。如果在未啟用與該升級相對應的功能的情況下升級池，似乎存在潛在問題。在過去 2 年左右的時間裡，這個池已經從多台機器/作業系統轉移，因此它可能是在舊版本的 ZFS 中創建的，並且在最新主機上的新版本 ZFS 上是有道理的：
root@host1:/# zpool get all | grep feature
...
zfspool  feature@resilver_defer         disabled                       local
...
看到這個後，我啟用了該功能。github 連結似乎表明這很危險，因此請確保您有備份。
root@host1:/# zpool set feature@resilver_defer=enabled zfspool
之後，zpool status 顯示 resilver 比以前更進一步：
root@host1:/# zpool status
 pool: zfspool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
       continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Dec  8 13:53:43 2019
       847G scanned at 2.03G/s, 396G issued at 969M/s, 5.11T total
       0B resilvered, 7.56% done, 0 days 01:25:14 to go
config:

       NAME                        STATE     READ WRITE CKSUM
       zfspool                     ONLINE       0     0     0
         mirror-0                  ONLINE       0     0     0
           wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
           wwn-0x5000cca266f1ae00  ONLINE       0     0     0

errors: No known data errors

引用自：https://serverfault.com/questions/994806

zpool 卡在重新同步循環中

相關問答

從缺少設備的 zpool 中恢復數據

“zpool scrub”什麼時候會自動刪除文件？

ZFS：設備 v dev 徘徊，導致 zpool 故障

看不到 ZFS 數據集上的正確空間和已用空間

將現有 RAID1 zpool 擴展為 RAID10 zpool

ZFS 池報告缺少設備，但並未失去