Linux
對 RAID1 的所有驅動器的 SMART 短離線測試永不結束
我在這裡遇到了一個奇怪的情況。在低流量伺服器中使用了大約三年後,RAID1 中的兩個三星驅動器之一昨天發生了故障:
Personalities : [raid1] md0 : active raid1 sdb1[2](F) sda1[0] 732572608 blocks [2/1] [U_]
由於 smartd 沒有報告任何內容,我檢查了 smart 屬性,與 sda 相比,sdb(失敗)上唯一的可疑讀數是:
sda: 191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0 sdb: 191 G-Sense_Error_Rate 0x0022 098 098 000 Old_age Always - 27650
伺服器機架中的 G-sense 錯誤?可能是感測器壞了?
但是兩個驅動器上還有另一個時髦的讀物:最新的短離線測試是“中斷(主機重置)”,如果我開始一個新的測試,例如使用
smartctl --test=short /dev/sda
,選擇性自檢日誌顯示:SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Self_test_in_progress [90% left] (0-65535)
然而,這個簡短的測試永遠不會結束,即使幾個小時後,情況仍然是一樣的——在兩個驅動器上。這可能是驅動器上的韌體錯誤嗎?還是控制器出現故障?
這是兩個驅動器的完整轉儲,每個驅動器都執行了一個簡短的自檢:
/dev/sda:
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.12.13-gentoo] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F3 Device Model: SAMSUNG HD754JJ Serial Number: S281J9CZ500175 LU WWN Device Id: 5 0024e9 2026e8417 Firmware Version: 1AJ10001 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Mar 21 09:04:35 2014 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 6540) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 109) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 59 2 Throughput_Performance 0x0026 055 052 000 Old_age Always - 6038 3 Spin_Up_Time 0x0023 072 071 025 Pre-fail Always - 8729 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 23571 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 060 053 000 Old_age Always - 40 (Min/Max 20/48) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 001 001 000 Old_age Always - 102119 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Aborted by host 90% 23571 - # 2 Short offline Aborted by host 90% 23571 - # 3 Short offline Aborted by host 90% 23571 - # 4 Short offline Aborted by host 90% 23571 - # 5 Short offline Interrupted (host reset) 90% 23571 - # 6 Short offline Interrupted (host reset) 90% 23571 - # 7 Short offline Interrupted (host reset) 90% 23571 - # 8 Short offline Interrupted (host reset) 90% 23571 - # 9 Extended offline Interrupted (host reset) 90% 23571 - #10 Short offline Interrupted (host reset) 90% 23571 - #11 Short offline Interrupted (host reset) 90% 23571 - #12 Short offline Completed without error 00% 23559 - #13 Short offline Completed without error 00% 23535 - #14 Short offline Completed without error 00% 23511 - #15 Extended offline Completed without error 00% 23495 - #16 Short offline Completed without error 00% 23487 - #17 Short offline Completed without error 00% 23463 - #18 Short offline Completed without error 00% 23439 - #19 Short offline Completed without error 00% 23415 - #20 Short offline Completed without error 00% 23391 - #21 Short offline Completed without error 00% 23367 - SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Self_test_in_progress [90% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdb:
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.12.13-gentoo] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F3 Device Model: SAMSUNG HD754JJ Serial Number: S281J9CZ500174 LU WWN Device Id: 5 0024e9 2026e840e Firmware Version: 1AJ10001 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Mar 21 09:05:10 2014 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 6960) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 116) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 69 2 Throughput_Performance 0x0026 055 054 000 Old_age Always - 6442 3 Spin_Up_Time 0x0023 071 071 025 Pre-fail Always - 8885 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 23571 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 191 G-Sense_Error_Rate 0x0022 098 098 000 Old_age Always - 27650 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 054 000 Old_age Always - 35 (Min/Max 20/46) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 001 001 000 Old_age Always - 71575 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Aborted by host 90% 23571 - # 2 Short offline Aborted by host 90% 23571 - # 3 Short offline Aborted by host 90% 23571 - # 4 Short offline Aborted by host 90% 23571 - # 5 Extended offline Interrupted (host reset) 90% 23571 - # 6 Short offline Interrupted (host reset) 90% 23571 - # 7 Short offline Interrupted (host reset) 90% 23571 - # 8 Short offline Interrupted (host reset) 90% 23571 - # 9 Short offline Interrupted (host reset) 90% 23571 - #10 Short offline Interrupted (host reset) 90% 23571 - #11 Short offline Interrupted (host reset) 90% 23571 - #12 Short offline Interrupted (host reset) 90% 23571 - #13 Short offline Completed without error 00% 23558 - #14 Short offline Completed without error 00% 23534 - #15 Short offline Completed without error 00% 23510 - #16 Short offline Completed without error 00% 23486 - #17 Extended offline Completed without error 00% 23471 - #18 Short offline Completed without error 00% 23462 - #19 Short offline Completed without error 00% 23438 - #20 Short offline Completed without error 00% 23414 - #21 Short offline Completed without error 00% 23390 - SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Self_test_in_progress [90% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
感謝您的提示!
控制器錯誤似乎在這裡。在您花費太多時間之前,我建議您重新安裝所有電纜。隨著時間的推移,某些東西可能會鬆動,並導致問題。
另外,您提到了 smartd …您在嘗試執行自檢時是否禁用了此功能?這可能會干擾手動測試。
有什麼
dmesg
關於為什麼 sdb 被認為是失敗的嗎?兩個驅動器似乎都將它們的健康狀況報告為通過,我不相信 mdadm 實際上使用任何 SMART 數據來確定驅動器的健康狀況。