Raid

HDD SMART解讀

  • March 7, 2020

如果下面的驅動器出現故障,我需要您的意見。

當我執行“smartctl -a /dev/sda -d megaraid,1”時,輸出末尾會出現 2 個錯誤,說明“錯誤:LBA 上的 WP”。我在 SMART 參數中沒有看到任何可疑之處。

這是“smartctl -a /dev/sda -d megaraid,1”的完整輸出。

此 HDD 是 RAID 1(鏡像)硬體配置中的兩個 HDD 之一,位於 Dell PowerEdge 伺服器上的 Dell H330 控制器上。

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" MG03ACAxxx(Y) Enterprise HDD
Device Model:     TOSHIBA MG03ACA300
Serial Number:    73VCK8GDF
LU WWN Device Id: 5 000039 4ebc82c58
Firmware Version: FL1A
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb 27 23:05:39 2020 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                       without error or no self-test has ever
                                       been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       No Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 510) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                       SCT Error Recovery Control supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
 2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
 3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8874
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
 5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
 8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
 9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       12964
10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       6
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       42
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       31 (Min/Max 11/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   068   068   000    Old_age   Always       -       12994
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       103
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 2
       CR = Command Register [HEX]
       FR = Features Register [HEX]
       SC = Sector Count Register [HEX]
       SN = Sector Number Register [HEX]
       CL = Cylinder Low Register [HEX]
       CH = Cylinder High Register [HEX]
       DH = Device/Head Register [HEX]
       DC = Device Command Register [HEX]
       ER = Error register [HEX]
       ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 12901 hours (537 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 41 10 0e fb 74 40  Error: WP at LBA = 0x0074fb0e = 7666446

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 61 08 00 48 7a e0 40 00  42d+20:47:35.187  WRITE FPDMA QUEUED
 61 08 20 58 89 8a 40 00  42d+20:47:35.187  WRITE FPDMA QUEUED
 61 10 20 48 89 8a 40 00  42d+20:47:35.187  WRITE FPDMA QUEUED
 61 08 20 48 7a e0 40 00  42d+20:47:35.183  WRITE FPDMA QUEUED
 61 08 20 40 89 8a 40 00  42d+20:47:35.183  WRITE FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 12901 hours (537 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 41 00 0e fb 74 40  Error: WP at LBA = 0x0074fb0e = 7666446

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 61 10 10 18 94 de 40 00  42d+20:47:32.312  WRITE FPDMA QUEUED
 60 00 08 00 fc 74 40 00  42d+20:47:32.311  READ FPDMA QUEUED
 60 00 00 00 fb 74 40 00  42d+20:47:32.311  READ FPDMA QUEUED
 60 00 00 00 fa 74 40 00  42d+20:47:32.284  READ FPDMA QUEUED
 60 00 00 00 f9 74 40 00  42d+20:47:32.264  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

稍後編輯1:

我還檢查了 PowerEdge 伺服器上的 iDRAC,並且在儲存菜單 > 摘要 > 最近記錄的儲存事件中,我發現了與發生 2 個 SMART 錯誤相對應的事件。

事件狀態:“在恢復期間更正了插槽 1 中 RAID 控制器背板 1 中磁碟 1 上的磁碟介質錯誤”。請在螢幕截圖下方找到。

來自 iDRAC > 儲存菜單 > 摘要 > 最近記錄的儲存事件的圖像

稍後編輯2:

幾天后,Current_Pending_Sector 在幾個小時內增加到 1,然後又減少到 0。

Reallocated_Sector_Ct、Reallocated_Event_Count 和 Offline_Uncorrectable 始終保持為 0。

SMART 錯誤日誌中還出現了另一個錯誤:“錯誤:LBA 的 UNC”。

雖然,iDRAC 中沒有出現其他錯誤。

我們決定用新驅動器更換驅動器,因為我們不再信任該驅動器。

謝謝!

兩個記錄的錯誤表明您的 HDD 無法讀取/寫入特定 LBA。但是,沒有Reallocated_Sector_Ct / Reallocated_Event_Count / Current_Pending_Sector記錄在哪裡,這似乎表明碟片方面沒有問題。

但是,這並不意味著您可以將錯誤視為軟體引起的錯誤:畢竟有些 LBA 沒有正確讀取/寫入,所以您遇到了真正的問題。當此類錯誤彈出而沒有相應的壞扇區時,通常是由以下原因引起的:

  • SATA/電源線壞
  • 電源不好
  • 振動太大。

在正確的 PowerEdge 伺服器上,您不應該遇到電纜問題(即:您使用的是 SATA 背板)。雖然偶爾會出現問題,但它們非常罕見。

另一方面,你有一個非零G-Sense_Error_Rate,所以可能讀/寫失敗與伺服器/磁碟的強烈振動有關。

我會同時監控 SATA 和dmesg日誌,以確保問題不會再次發生。如果您再次看到它,請記下受影響的 LBA 範圍並將其與上述LBA = 0x0074fb0e = 7666446(消費級磁碟上相對常見的行為,儘管對於企業級 HDD 來說令人驚訝)。

**更新:**根據您的 iDRAC 日誌,似乎使用巡邏讀取儲存在另一個鏡像分支中的值更正了介質錯誤。這似乎真的是一個真正的壞塊;然而,相對 SMART 計數器沒有增加的事實令人費解。過去我看到一個磁碟只有在受影響的扇區報告兩次讀/寫錯誤時才會重新分配扇區,但對於東芝企業磁碟來說會很奇怪。

另一種可能的解釋是某些東西(可能是強烈的振動)導致寫入錯誤/未對齊/撕裂,導致扇區無法讀取。然而,由於該扇區並沒有真正損壞,它被 Patrol Read 成功覆蓋,沒有發生任何重新分配。

最後,它可能是真正的 bitrot 案例:寫入的數據與 HDD 內部 ECC 校驗和不匹配。在這些情況下,HDD 旨在返回讀取錯誤;但是,這並不能解釋上面報告的寫入錯誤。

無論如何,偶爾的媒體修正是可以預料的。但是,如上所述,如果此類報告變得更加頻繁,我會監視情況並替換磁碟。

引用自:https://serverfault.com/questions/1004967