通過所有診斷後確認磁碟已損壞
我有一個磁碟可能損壞的系統,但磁碟通過了各種診斷。我一直無法確認磁碟已損壞。我有哪些選擇?
我可以只更換磁碟,但因為這種情況與我遇到的另一個更嚴重的情況非常相似(長話短說),我想實際做出正確的診斷,而不是隨機裝箱硬體。
問題和歷史是這樣的:
- 我有一台 Debian Linux PC (500 MHz P3) 作為路由器、nagios 和 munin。
- 它每隔幾週就會崩潰一次。無法獲取任何日誌或 dmesg(因為它是一個舊的 Compaq,只有在您將其配置為無鍵盤時才會啟動,因此一旦啟動後就無法連接鍵盤)。
- 當時,我只是用另一台 Compaq (P4 2.4 GHz) 更換電腦,因為我認為硬體有問題。但是,它仍然每兩週崩潰一次。
- 不同的是,在這台電腦上,我仍然可以通過 SSH 連接到它。它給出了hda上的各種錯誤。
我想確認磁碟壞了,但我沒有做任何事情來證實這一點:
- SMART 錯誤日誌顯示沒有錯誤。通常,當磁碟開始執行時,SMART my pass,但它仍然在錯誤日誌中記錄讀取錯誤。
- SMART 自檢 (
smartctl -t long /dev/sda
) 完成且沒有錯誤。- 重新分配的扇區數(一個標誌性參數)一直是 31,即使幾年前我的台式機仍在使用磁碟,現在仍然如此。這個數字從未改變。
dd if=/dev/sda of=/dev/null bs=4096
以優異的成績通過。我還能做些什麼來評估驅動器的執行狀況?
同樣,這不是要讓這個路由器再次完全正常工作,這是一個磁碟取證問題,因為碰巧我有另一台伺服器可能有同樣的問題,知道這個問題的答案可能會對我有很大幫助。
作為記錄,以下是日誌等。
這是
smartctl -a
輸出:smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST3120026A Serial Number: 5JT1CLQM Firmware Version: 3.06 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Mon Jul 1 21:18:33 2013 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 24) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 85) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 050 046 006 Pre-fail Always - 47766662 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 31 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 820305 9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 46373 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 605 194 Temperature_Celsius 0x0022 036 065 000 Old_age Always - 36 195 Hardware_ECC_Recovered 0x001a 050 046 000 Old_age Always - 47766662 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 196 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Aborted by host 80% 46361 - # 2 Extended offline Completed without error 00% 46358 - # 3 Short offline Completed without error 00% 12046 - # 4 Extended offline Completed without error 00% 10472 - # 5 Short offline Completed without error 00% 10471 - # 6 Short offline Completed without error 00% 10471 - # 7 Short offline Completed without error 00% 6770 - # 8 Extended offline Aborted by host 90% 5958 - # 9 Extended offline Aborted by host 90% 5951 - #10 Short offline Completed without error 00% 5024 - #11 Extended offline Aborted by host 80% 5024 - #12 Short offline Completed without error 00% 3697 - #13 Short offline Completed without error 00% 237 - #14 Short offline Completed without error 00% 145 - #15 Short offline Completed without error 00% 69 - #16 Extended offline Completed without error 00% 68 - #17 Short offline Completed without error 00% 66 - #18 Short offline Completed without error 00% 49 - #19 Short offline Completed without error 00% 29 - #20 Short offline Completed without error 00% 29 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
這是崩潰時的 dmesg 錯誤(對於一堆不同的扇區重複):
[1755091.211136] sd 0:0:0:0: [sda] Unhandled error code [1755091.211144] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [1755091.211151] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 08 fe ad 38 00 00 08 00 [1755091.211166] end_request: I/O error, dev sda, sector 150908216
你不能可靠。
或者更確切地說,您已經使用您可以使用的選項完成了它。
正如google 的一項研究發現的那樣,故障磁碟不一定會顯示異常的 SMART 值(但反過來更可靠:當它們出現時,它們會出現故障)。
暫時擱置這一點,請記住,儘管很多在計算方面是標準化的,但實際上硬體和軟體中都存在錯誤,可以累積的誤差範圍等。現實世界並不完美,也不是未見硬碟與特定控制器不兼容- 反之亦然。有時這是韌體故障的問題,有時是某些完全不同的系統組件無法正常執行,例如低於標準的 PSU 在特定負載峰值時會發出聲音。甚至溫度變化、年齡……這個列表幾乎可以隨意擴展。
因此,這裡的標準程序是將磁碟置於明顯不同的系統配置中並重新執行測試 - 但由於您已經完成了對系統的完全更改,因此您已經正確地得出磁碟一定有問題的結論。(除非您沒有像您告訴我們的那樣更改其他所有內容 - 想到電纜/HBA,在這種情況下,假設不成立)。
編輯:我剛剛意識到還有一個選擇;您可以搜尋此磁碟驅動器是否有比目前特定驅動器上的更新的韌體版本。如果是這樣,您可以查看更改日誌,指出您的情況可能存在的問題。
總之,要完全確定(在這種特殊情況下!)驅動器行為異常,您需要將其發送回製造商。