Hard-Drive
相對較新的 WD Red Pro 產生 ATA 狀態:41 (DRDY ERR),錯誤:FreeBSD 12.2 上的 40 (UNC)
我正在執行基於FreeBSD 12.2的****TrueNAS伺服器。我將儲存遷移到10 TB WD Red Pro。他們現在執行了 42 天。
突然,在 ZFS 清理期間,其中一個磁碟產生了 5 個錯誤。他們都或多或少地讀到:
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 08 3a 0f 40 f8 01 00 07 00 00 (ada2:ahcich14:0:0:0): CAM status: ATA Status Error (ada2:ahcich14:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) (ada2:ahcich14:0:0:0): RES: 41 40 90 3b 0f 40 f8 01 00 30 06 (ada2:ahcich14:0:0:0): Retrying command, 3 more tries remain
事件發生後我進行了擴展的 SMART 測試,但沒有產生任何錯誤(記錄的錯誤除外),尤其是沒有重新定位的扇區等:
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p2 amd64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD102KFBX-68M95N0 Serial Number: [deleted] LU WWN Device Id: 5 000cca 0b0cd3041 Firmware Version: 83.00A83 User Capacity: 10,000,831,348,736 bytes [10.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: [deleted] SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 87) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: (1108) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0004 132 132 054 Old_age Offline - 96 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1077 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 215 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 215 194 Temperature_Celsius 0x0002 142 142 000 Old_age Always - 42 (Min/Max 25/67) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 5 occurred at disk power-on lifetime: 1050 hours (43 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 b8 10 08 3a 0f 40 08 15d+21:45:25.729 READ FPDMA QUEUED 60 80 38 b8 60 0f 40 08 15d+21:45:18.777 READ FPDMA QUEUED 60 b8 30 f8 58 0f 40 08 15d+21:45:18.775 READ FPDMA QUEUED 60 b8 28 40 51 0f 40 08 15d+21:45:18.775 READ FPDMA QUEUED 60 b8 20 80 49 0f 40 08 15d+21:45:15.608 READ FPDMA QUEUED Error 4 occurred at disk power-on lifetime: 1050 hours (43 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 b8 28 10 d8 0e 40 08 15d+21:45:10.298 READ FPDMA QUEUED 60 80 40 48 ef 0e 40 08 15d+21:45:03.370 READ FPDMA QUEUED 60 b8 38 88 e7 0e 40 08 15d+21:45:03.178 READ FPDMA QUEUED 60 b8 30 d0 df 0e 40 08 15d+21:45:00.444 READ FPDMA QUEUED 60 20 20 f0 d3 0e 40 08 15d+21:45:00.286 READ FPDMA QUEUED Error 3 occurred at disk power-on lifetime: 1050 hours (43 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 b8 00 90 81 23 40 08 15d+21:41:08.578 READ FPDMA QUEUED 60 80 10 08 91 23 40 08 15d+21:41:08.336 READ FPDMA QUEUED 60 b8 08 48 89 23 40 08 15d+21:41:01.627 READ FPDMA QUEUED 60 b8 f8 d0 79 23 40 08 15d+21:40:57.546 READ FPDMA QUEUED 60 b8 f0 18 72 23 40 08 15d+21:40:56.899 READ FPDMA QUEUED Error 2 occurred at disk power-on lifetime: 1050 hours (43 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 b8 18 f0 d5 17 40 08 15d+21:34:13.263 READ FPDMA QUEUED 60 20 50 10 0c 18 40 08 15d+21:34:06.288 READ FPDMA QUEUED 60 b8 48 58 04 18 40 08 15d+21:34:06.288 READ FPDMA QUEUED 60 b8 40 98 fc 17 40 08 15d+21:34:06.288 READ FPDMA QUEUED 60 b8 38 e0 f4 17 40 08 15d+21:34:06.288 READ FPDMA QUEUED Error 1 occurred at disk power-on lifetime: 1050 hours (43 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 b8 50 28 8b 17 40 08 15d+21:33:33.959 READ FPDMA QUEUED 60 b8 48 70 83 17 40 08 15d+21:33:16.648 READ FPDMA QUEUED 60 80 40 e8 82 17 40 08 15d+21:33:16.647 READ FPDMA QUEUED ea 00 00 00 00 00 40 08 15d+21:33:16.640 FLUSH CACHE EXT 61 08 30 f0 fd 3f 40 08 15d+21:33:16.638 WRITE FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 1072 - # 2 Short offline Completed without error 00% 1023 - # 3 Extended offline Completed without error 00% 946 - # 4 Short offline Completed without error 00% 855 - # 5 Short offline Completed without error 00% 687 - # 6 Extended offline Completed without error 00% 610 - # 7 Short offline Completed without error 00% 519 - # 8 Short offline Completed without error 00% 279 - # 9 Extended offline Completed without error 00% 202 - #10 Short offline Completed without error 00% 111 - #11 Short offline Completed without error 00% 11 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
首先,我認為我可能買了有缺陷的磁碟。我本來希望 SMART 無法通過評估。然而,這種情況並非如此。我不認為它是有缺陷的 PSU,因為它甚至還不到一歲。此外,它是一個 550 瓦的 PSU,機器消耗大約 100 瓦。我也不認為這是一條有缺陷的電纜,因為我執行其他磁碟將近一年都沒有問題。此外,對於這些其他光碟,我實際上有一條有缺陷的電纜,我更換了它,觀察結果有所不同。
我正在考慮對驅動器進行 RMA,但我不確定它是否符合 RMA 的條件。你怎麼看?這可能是一個暫時的錯誤嗎?任何建議表示讚賞。
非常感謝您的回复。根據你所說的,我決定將驅動器返還給西部數據。驅動器已經更換。
UNC
表示UNC可糾正錯誤,這通常是由於中等錯誤(即:物理扇區壞了)。但是,您的 SMART 日誌顯示錯誤發生在 LBA 0x0。雖然理論上是可能的,但對我來說,您在該地址恰好遇到了讀取錯誤,這似乎很奇怪。報告的 SMART 待處理扇區 (0) 似乎證實了這一點,但我看到許多磁碟沒有正確更新該欄位。你能讀懂 LBA 0 嗎?請報告執行以下命令時發生的情況:
dd if=<your_disk> of=/dev/null bs=512 count=1 iflag=direct
如果您可以成功讀取 LBA,則可能是您遇到了佈線問題,或者您的 HDD 的電子設備出現故障/不穩定。