Ubuntu
Linux:添加分區時重建軟體 Raid 1 失敗
昨天我遇到了一個軟體 Raid 問題,必須更換一個磁碟。我使用從陣列中刪除了分區
mdadm /dev/mdx -r /dev/sdbx
在託管中心更換故障驅動器後,我將分區表應用到新磁碟(sdb 是壞設備)
sgdisk -R /dev/sdb /dev/sda
給它一個新的ID:
sgdisk -G /dev/sdb
然後我再次使用以下方法添加了所有分區:
mdadm /dev/mdx -r /dev/sdbx
這對所有分區都很順利,除了一個,幾個小時後大約在 60% 後退出 這是目前的 RAID 狀態:
cat /proc/mdstat Personalities : [raid1] md5 : active raid1 sda6[0] sdb6[2](S) 2633910528 blocks super 1.2 [2/1] [U_] md4 : active raid1 sda5[0] sdb5[2] 16768896 blocks super 1.2 [2/2] [UU] md3 : active raid1 sda4[0] sdb4[2] 2096064 blocks super 1.2 [2/2] [UU] md2 : active raid1 sda3[0] sdb3[2] 268304192 blocks super 1.2 [2/2] [UU] md1 : active raid1 sda2[0] sdb2[2] 523968 blocks super 1.2 [2/2] [UU] md0 : active raid1 sda1[0] sdb1[2] 8384448 blocks super 1.2 [2/2] [UU] unused devices: <none>
在 syslog 中,我可以看到如下消息:
n 23 14:24:04 rescue kernel: [11163.329021] ata1.00: exception Emask 0x0 SAct 0xf00000 SErr 0x0 action 0x0 Jan 23 14:24:04 rescue kernel: [11163.376449] ata1.00: configured for UDMA/133 Jan 23 14:24:04 rescue kernel: [11163.376475] sd 0:0:0:0: [sda] Unhandled sense code Jan 23 14:24:04 rescue kernel: [11163.376477] sd 0:0:0:0: [sda] Jan 23 14:24:04 rescue kernel: [11163.376479] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jan 23 14:24:04 rescue kernel: [11163.376481] sd 0:0:0:0: [sda] Jan 23 14:24:04 rescue kernel: [11163.376483] Sense Key : Medium Error [current] [descriptor] Jan 23 14:24:04 rescue kernel: [11163.376486] Descriptor sense data with sense descriptors (in hex): Jan 23 14:24:04 rescue kernel: [11163.376487] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Jan 23 14:24:04 rescue kernel: [11163.376495] ce 1f 0d 58 Jan 23 14:24:04 rescue kernel: [11163.376498] sd 0:0:0:0: [sda] Jan 23 14:24:04 rescue kernel: [11163.376501] Add. Sense: Unrecovered read error - auto reallocate failed Jan 23 14:24:04 rescue kernel: [11163.376503] sd 0:0:0:0: [sda] CDB: Jan 23 14:24:04 rescue kernel: [11163.376504] Read(16): 88 00 00 00 00 00 ce 1f 0b 80 00 00 04 00 00 00 Jan 23 14:24:04 rescue kernel: [11163.376513] end_request: I/O error, dev sda, sector 3458141528
和
Jan 23 14:35:22 rescue kernel: [11840.396206] ata1.00: configured for UDMA/133 Jan 23 14:35:22 rescue kernel: [11840.396212] ata1.00: device reported invalid CHS sector 0 Jan 23 14:35:22 rescue kernel: [11840.396216] ata1.00: device reported invalid CHS sector 0 Jan 23 14:35:22 rescue kernel: [11840.396220] ata1.00: device reported invalid CHS sector 0 Jan 23 14:35:22 rescue kernel: [11840.396223] ata1.00: device reported invalid CHS sector 0 Jan 23 14:35:22 rescue kernel: [11840.396230] ata1: EH complete Jan 23 14:35:52 rescue kernel: [11870.888343] ata1.00: exception Emask 0x0 SAct 0x40000007 SErr 0x0 action 0x6 frozen Jan 23 14:35:52 rescue kernel: [11870.945207] ata1.00: cmd 60/00:08:80:c3:58/04:00:ce:00:00/40 tag 1 ncq 524288 in Jan 23 14:35:52 rescue kernel: [11870.945207] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 23 14:35:52 rescue kernel: [11870.982487] ata1.00: cmd 60/80:10:00:c0:58/03:00:ce:00:00/40 tag 2 ncq 458752 in Jan 23 14:35:52 rescue kernel: [11870.982487] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 23 14:35:53 rescue kernel: [11871.019291] ata1.00: cmd 60/00:f0:80:cb:58/04:00:ce:00:00/40 tag 30 ncq 524288 in Jan 23 14:35:53 rescue kernel: [11871.019291] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 23 14:35:53 rescue kernel: [11871.055486] ata1: hard resetting link Jan 23 14:35:53 rescue kernel: [11871.707811] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jan 23 14:35:53 rescue kernel: [11871.708270] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359) Jan 23 14:35:53 rescue kernel: [11871.708279] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131 218/psparse-536) Jan 23 14:35:53 rescue kernel: [11871.709174] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359) Jan 23 14:35:53 rescue kernel: [11871.709182] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131 218/psparse-536)
我能夠掛載 /dev/md5 並列出文件。但是我無法將新分區添加到陣列中。
有什麼辦法可以解決這個問題而不會失去分區上的數據?
如果沒有,是否可以僅格式化該單個分區然後再次添加新驅動器?我應該有該分區的最新備份,所以這不是問題。如果可能的話,我只想擦除所有分區。
智能輸出:
/dev/sda:
smartctl -a /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST3000DM001-1CH166 Serial Number: Z1F1XJHC LU WWN Device Id: 5 000c50 04f3fc2c7 Firmware Version: CC24 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Jan 23 16:16:32 2015 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled Error SMART Values Read failed: scsi error aborted command Smartctl: SMART Read Values failed. === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: UNKNOWN! SMART Status, Attributes and Thresholds cannot be read. SMART Error Log Version: 1 ATA Error Count: 107 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 107 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:49.931 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:48.680 READ DMA EXT ef 10 02 00 00 00 a0 00 15:56:48.644 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 15:56:48.644 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 15:56:48.644 IDENTIFY DEVICE Error 106 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:45.363 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:44.071 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:42.789 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:42.755 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:42.722 READ DMA EXT Error 105 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:15.716 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:12.832 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:11.540 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:10.290 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:09.448 READ DMA EXT Error 104 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:02.563 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:59.655 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:58.319 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:58.069 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:57.838 READ DMA EXT Error 103 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 ff ff ff ef 00 15:55:51.995 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:50.735 READ DMA EXT ef 10 02 00 00 00 a0 00 15:55:50.700 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 15:55:50.700 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 15:55:50.699 IDENTIFY DEVICE SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4561 - # 2 Extended offline Completed without error 00% 2977 - # 3 Extended offline Completed without error 00% 5 - Device does not support Selective Self Tests/Logging
/dev/sdb:
smartctl -a /dev/sdb smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST33000650NS Serial Number: Z295TK0G LU WWN Device Id: 5 000c50 04f891ded Firmware Version: 0004 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Jan 23 16:15:30 2015 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x10bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 078 053 044 Pre-fail Always - 70825960 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 791126750 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7155 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 11 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 090 090 000 Old_age Always - 10 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 066 043 045 Old_age Always In_the_past 34 (5 173 37 27) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 11 194 Temperature_Celsius 0x0022 034 057 000 Old_age Always - 34 (0 24 0 0) 195 Hardware_ECC_Recovered 0x001a 018 007 000 Old_age Always - 70825960 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 18 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 18 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 18 ff ff ff 4f 00 26d+03:52:28.560 WRITE FPDMA QUEUED 60 00 00 ff ff ff 4f 00 26d+03:52:28.560 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED Error 17 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 ff ff ff 4f 00 26d+03:52:13.471 READ FPDMA QUEUED 60 00 58 d0 57 44 43 00 26d+03:52:13.471 READ FPDMA QUEUED 61 00 02 08 90 6d 49 00 26d+03:52:13.471 WRITE FPDMA QUEUED ea 00 00 00 00 00 a0 00 26d+03:52:13.470 FLUSH CACHE EXT 60 00 00 e0 42 20 4e 00 26d+03:52:13.422 READ FPDMA QUEUED Error 16 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 26d+03:51:56.176 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:56.176 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:56.175 READ FPDMA QUEUED 60 00 00 e0 0d 20 4e 00 26d+03:51:56.116 READ FPDMA QUEUED 60 00 00 e0 0c 20 4e 00 26d+03:51:56.114 READ FPDMA QUEUED Error 15 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 50 59 cb 43 00 26d+03:51:24.077 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:24.077 READ FPDMA QUEUED 60 00 00 e0 c5 1c 4e 00 26d+03:51:24.076 READ FPDMA QUEUED ea 00 00 00 00 00 a0 00 26d+03:51:24.071 FLUSH CACHE EXT 60 00 08 28 46 c1 43 00 26d+03:51:22.717 READ FPDMA QUEUED Error 14 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 26d+03:51:02.317 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 26d+03:51:02.317 WRITE FPDMA QUEUED ea 00 00 00 00 00 a0 00 26d+03:51:02.316 FLUSH CACHE EXT 60 00 08 ff ff ff 4f 00 26d+03:51:02.303 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:02.300 READ FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 7071 - # 2 Extended offline Completed without error 00% 7060 - # 3 Extended offline Completed without error 00% 5600 - # 4 Short offline Completed without error 00% 2489 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
在我看來,這個問題很大程度上取決於
sda
. 這是鏡像目前唯一的一半,所以如果它不能被讀取,就沒有辦法乾淨地sdb6
複制sda6
和重新同步鏡像。我注意到自
sda
上次通過自檢以來已經過去了將近 10,000 小時,因此硬體故障也可能出現的想法似乎不足為奇。如果您仍然可以讀取/dev/md5
您躲過子彈的內容,則意味著不可讀的塊不在文件中。備份該分區的內容,然後也進行替換sda
,這次將其替換為相當新的磁碟。一切穩定後,重新製作md5
設備,然後從備份中恢復。一旦你得到這個系統備份,確保你有一份
cron
工作至少每個月或兩個月在兩個驅動器上執行smartctl
測試,否則這正是你得到的那種警告,事情正在走向梨形。