e2fsck 需要很長時間才能執行

June 13, 2020

我在我的一個磁碟分區（ext4）上執行 e2fsck，但它似乎需要永恆。它現在已經執行了將近 10 個小時左右，仍然是 42%。分區的大小約為 800Gigs，總磁碟大小（分區所在的）約為 1TB。

執行 iostat 顯示以下輸出：

iostat -xzhcd  /dev/sdc 2 5
Linux 3.13.0-37-generic (divick-desktop)    Monday 03 April 2017    _x86_64_    (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          2.97    0.00    0.41   50.22    0.00   46.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc
                49.12     0.00    6.87    0.00   223.95     0.02    65.20     1.01  147.22  145.40 4611.03 143.47  98.57

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          4.25    0.00    9.63   71.67    0.00   14.45

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc
                 0.00     0.00    1.50    0.00     6.00     0.00     8.00     1.00  592.00  592.00    0.00 665.33  99.80

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          2.71    0.00    6.63   59.34    0.00   31.33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc
                 0.00     0.00    1.50    0.00     6.00     0.00     8.00     1.00  592.00  592.00    0.00 666.67 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          3.76    0.00    9.25   56.94    0.00   30.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc
                 0.00     0.00    3.50    0.00    14.00     0.00     8.00     1.00  508.00  508.00    0.00 285.71 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          3.39    0.00    7.63   73.73    0.00   15.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc
                 0.00     0.00    1.50    0.00     6.00     0.00     8.00     1.00  593.33  593.33    0.00 666.67 100.00

為什麼 r_await 時間如此之高（~0.5 ms）？這是磁碟故障的信號還是因為其他原因？

解釋在磁碟上執行 smarttests 的結果似乎有點混亂。我在智能測試輸出中看到以下幾行：

SMART整體健康自我評估測試結果：通過

但是看看我看到的詳細輸出：

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   192   192   051    Pre-fail  Always       -       13824
 3 Spin_Up_Time            0x0027   119   111   021    Pre-fail  Always       -       7008
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       515
 5 Reallocated_Sector_Ct   0x0033   165   165   140    Pre-fail  Always       -       671
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10561
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       511
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       182
193 Load_Cycle_Count        0x0032   128   128   000    Old_age   Always       -       218580
194 Temperature_Celsius     0x0022   101   080   000    Old_age   Always       -       46
196 Reallocated_Event_Count 0x0032   018   018   000    Old_age   Always       -       182
197 Current_Pending_Sector  0x0032   198   197   000    Old_age   Always       -       480
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       35
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       210

我不清楚磁碟是否真的出現故障。

列出的 SMART 輸出似乎表明驅動器即將失效。特別：
197 Current_Pending_Sector  0x0032   198   197   000    Old_age   Always       -       480
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       35
當這兩個屬性中的一個或兩個屬性的“RAW_VALUE”不為零時，我建議立即更換驅動器。

從 SMART 輸出的 13824 Raw_Read_Error_Rate 可以看出，驅動器的讀取請求失敗，這可能導致 sar 輸出中的 r_await 和 iowait 較高。驅動器可能需要很長時間來處理讀取請求，然後在超時後失敗/中止。我還將檢查 dmesg 輸出中是否存在驅動程序/設備錯誤以進一步確認。

引用自：https://serverfault.com/questions/842131

e2fsck 需要很長時間才能執行

相關問答

在伺服器上連續 3 天執行具有高 I/O 的任務是否被認為是“安全的”，或者您會在其中添加暫停嗎？

非常高的負載，顯然是由pdflush引起的

在 CentOS 5 中，如何判斷哪些程序正在大量寫入磁碟？

iostat 如何計算出將 CPU 空閒時間視為 %idle 或 %iowait？

單個物理磁碟的 IOPS 總和是否不等於 raid0 邏輯磁碟？

使用 fio 或其他工具模擬一些程序 IO