zpool 卡在重新同步循環中
我有以下 zpool:
NAME STATE READ WRITE CKSUM zfspool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x5000cca266f3d8ee ONLINE 0 0 0 wwn-0x5000cca266f1ae00 ONLINE 0 0 0
今天早上主持人經歷了一個事件(仍在深入研究。負載非常高,很多東西都不起作用,但我仍然可以進入它)。
重新啟動時,主機在啟動期間掛起,等待依賴於上述池中數據的服務。
懷疑池有問題,我卸下了其中一個驅動器並再次重新啟動。主持人這次上線了。
擦洗顯示現有磁碟上的所有數據都很好。完成後,我重新插入了被移除的驅動器。驅動器開始重新同步,但完成了大約 4%,然後重新啟動。
smartctl 顯示任何一個驅動器都沒有問題(沒有記錄錯誤,WHEN_FAILED 為空)。
但是,我不知道哪個磁碟正在重新同步,實際上看起來池很好,根本不需要重新同步。
errors: No known data errors root@host1:/var/log# zpool status pool: zfspool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Dec 8 12:20:53 2019 46.7G scanned at 15.6G/s, 45.8G issued at 15.3G/s, 5.11T total 0B resilvered, 0.87% done, 0 days 00:05:40 to go config: NAME STATE READ WRITE CKSUM zfspool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x5000cca266f3d8ee ONLINE 0 0 0 wwn-0x5000cca266f1ae00 ONLINE 0 0 0 errors: No known data errors
擺脫這種重新同步循環的最佳途徑是什麼?其他答案建議分離正在重新同步的驅動器,但就像我說的那樣,它看起來不像任何一個。
編輯:
zpool events 大約是以下 1000 次重複:
Dec 8 2019 13:22:12.493980068 sysevent.fs.zfs.resilver_start version = 0x0 class = "sysevent.fs.zfs.resilver_start" pool = "zfspool" pool_guid = 0x990e3eff72d0c352 pool_state = 0x0 pool_context = 0x0 time = 0x5ded4d64 0x1d7189a4 eid = 0xf89 Dec 8 2019 13:22:12.493980068 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "zfspool" pool_guid = 0x990e3eff72d0c352 pool_state = 0x0 pool_context = 0x0 history_hostname = "host1" history_internal_str = "func=2 mintxg=7381953 maxtxg=9049388" history_internal_name = "scan setup" history_txg = 0x8a192e history_time = 0x5ded4d64 time = 0x5ded4d64 0x1d7189a4 eid = 0xf8a Dec 8 2019 13:22:17.485979213 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "zfspool" pool_guid = 0x990e3eff72d0c352 pool_state = 0x0 pool_context = 0x0 history_hostname = "host1" history_internal_str = "errors=0" history_internal_name = "scan aborted, restarting" history_txg = 0x8a192f history_time = 0x5ded4d69 time = 0x5ded4d69 0x1cf7744d eid = 0xf8b Dec 8 2019 13:22:17.733979170 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "zfspool" pool_guid = 0x990e3eff72d0c352 pool_state = 0x0 pool_context = 0x0 history_hostname = "host1" history_internal_str = "errors=0" history_internal_name = "starting deferred resilver" history_txg = 0x8a192f history_time = 0x5ded4d69 time = 0x5ded4d69 0x2bbfa222 eid = 0xf8c Dec 8 2019 13:22:17.733979170 sysevent.fs.zfs.resilver_start version = 0x0 class = "sysevent.fs.zfs.resilver_start" pool = "zfspool" pool_guid = 0x990e3eff72d0c352 pool_state = 0x0 pool_context = 0x0 time = 0x5ded4d69 0x2bbfa222 eid = 0xf8d ...
現在已解決。
github上的以下問題提供了答案:
https://github.com/zfsonlinux/zfs/issues/9551
在這種情況下,危險信號可能是快速循環的
"starting deferred resilver"
事件,如zpool events -v
連結中的第一個建議是禁用 zfs-zed 服務。就我而言,它一開始就沒有啟用。
第二個建議是驗證 zpool 是否啟動了 defer_resilver 功能。如果在未啟用與該升級相對應的功能的情況下升級池,似乎存在潛在問題。在過去 2 年左右的時間裡,這個池已經從多台機器/作業系統轉移,因此它可能是在舊版本的 ZFS 中創建的,並且在最新主機上的新版本 ZFS 上是有道理的:
root@host1:/# zpool get all | grep feature ... zfspool feature@resilver_defer disabled local ...
看到這個後,我啟用了該功能。github 連結似乎表明這很危險,因此請確保您有備份。
root@host1:/# zpool set feature@resilver_defer=enabled zfspool
之後,zpool status 顯示 resilver 比以前更進一步:
root@host1:/# zpool status pool: zfspool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Dec 8 13:53:43 2019 847G scanned at 2.03G/s, 396G issued at 969M/s, 5.11T total 0B resilvered, 7.56% done, 0 days 01:25:14 to go config: NAME STATE READ WRITE CKSUM zfspool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x5000cca266f3d8ee ONLINE 0 0 0 wwn-0x5000cca266f1ae00 ONLINE 0 0 0 errors: No known data errors