mdadm 報告磁碟故障，但 smart 沒有發現問題

July 26, 2014

我有一個 9 磁碟 raid 5 陣列。

今天我從我的伺服器收到一封郵件：

This is an automatically generated mail message from mdadm
running on Eldorado

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdi1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid5 sdb1[1] sdi1[9](F) sdd1[5] sdh1[3] sdj1[7] sde1[4] sdg1[6] sdf1[0] sdc1[2]
 7801484288 blocks level 5, 64k chunk, algorithm 2 [9/8] [UUUUUUUU_]

unused devices: &lt;none&gt;

這看起來像 /dev/sdi 有問題。

然而，我跑了

smartctl -t long -d 3ware,7 /dev/twa0

（驅動器在 3ware 控制器上，我之前也進行了簡短和方便的測試），無論如何，smartctl 沒有報告嚴重問題：

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       7
 3 Spin_Up_Time            0x0027   228   109   021    Pre-fail  Always       -       1591
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       609
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15445
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       607
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       606
193 Load_Cycle_Count        0x0032   134   134   000    Old_age   Always       -       199738
194 Temperature_Celsius     0x0022   113   106   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%     15434         -
# 2  Short offline       Completed without error       00%     15434         -

所以目前，我不確定是什麼導致了故障，我是否可以重新添加驅動器或需要更換它。

我在 ubuntu 12.04 伺服器上，mdadm v3.2.5

有什麼線索嗎？

我知道執行緒Ubuntu 12.04 Server Software RAID1 - Faulty Spare - Smart Output Passed - Confused這似乎反映了這個問題。但是這個文章還沒有回答。

最好的問候，斯蒂芬

假設您使用的是消費級驅動器，最可能的原因是驅動器響應請求的時間過長，並且控制器卡假定驅動器出現故障。
與伺服器級韌體相比，消費級驅動器韌體嘗試從難以讀取的扇區恢復數據的時間更長。這使它們在單磁碟操作中更可靠，但在 RAID 陣列中使用時，當驅動器實際上沒有任何問題時，它們會被標記為“失敗”。
很可能你的驅動器沒有問題。如果您感到偏執，可以對壞塊（只讀或讀寫）執行表面掃描，但我只是將其放回數組中。

引用自：https://serverfault.com/questions/615728

mdadm 報告磁碟故障，但 smart 沒有發現問題

相關問答

使用 mdadm 刪除 raid 後的 DegradedArray 事件

在raid中啟用新磁碟

smart long test - 對繁忙的伺服器的性能影響是什麼？

RAID1 建構期間斷電

在 Linux RAID 中更換故障硬碟後，如何/何時物理移除故障硬碟？

Ubuntu EC2 Raid0 Ephemeral – 重啟後的 SSH 連接