Linux
mdadm 在 99.9% 處停止重建 RAID5 陣列
我最近在我的 QNAP TS-412 NAS 中安裝了三個新磁碟。
這三個新磁碟應該與已經存在的磁碟組合成一個 4 磁碟 RAID5 陣列,所以我開始了遷移過程。
多次嘗試(每次大約需要 24 小時)後,遷移似乎成功,但導致 NAS 無響應。
那時我重置了 NAS。一切都從那裡走下坡路:
- NAS 啟動,但將第一個磁碟標記為故障,並將其從所有陣列中刪除,使其癱瘓。
- 我對磁碟進行了檢查,但找不到任何問題(無論如何這會很奇怪,因為它幾乎是新的)。
- 管理界面沒有提供任何恢復選項,所以我想我只能手動完成。
我已經使用
mdadm
(being/dev/md4
,/dev/md13
and/dev/md9
) 成功重建了所有 QNAP 內部 RAID1 陣列,只留下 RAID5 陣列;/dev/md0
:我現在已經嘗試了多次,使用這些命令:
mdadm -w /dev/md0
(需要,因為陣列在從陣列中移除後由 NAS 以只讀方式安裝
/dev/sda3
。無法在 RO 模式下修改陣列)。mdadm /dev/md0 --re-add /dev/sda3
之後陣列開始重建。但它停滯在 99.9%,而係統非常緩慢和/或無響應。(大多數情況下使用 SSH 登錄失敗)。
目前的情況:
[admin@nas01 ~]# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md4 : active raid1 sdd2[2](S) sdc2[1] sdb2[0] 530048 blocks [2/2] [UU] md0 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1] 8786092608 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/3] [_UUU] [===================>.] recovery = 99.9% (2928697160/2928697536) finish=0.0min speed=110K/sec md13 : active raid1 sda4[0] sdb4[1] sdd4[3] sdc4[2] 458880 blocks [4/4] [UUUU] bitmap: 0/57 pages [0KB], 4KB chunk md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1] 530048 blocks [4/4] [UUUU] bitmap: 2/65 pages [8KB], 4KB chunk unused devices: <none>
(現在已經停滯
2928697160/2928697536
了幾個小時)[admin@nas01 ~]# mdadm -D /dev/md0 /dev/md0: Version : 01.00.03 Creation Time : Thu Jan 10 23:35:00 2013 Raid Level : raid5 Array Size : 8786092608 (8379.07 GiB 8996.96 GB) Used Dev Size : 2928697536 (2793.02 GiB 2998.99 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Jan 14 09:54:51 2013 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 99% complete Name : 3 UUID : 0c43bf7b:282339e8:6c730d6b:98bc3b95 Events : 34111 Number Major Minor RaidDevice State 4 8 3 0 spare rebuilding /dev/sda3 1 8 19 1 active sync /dev/sdb3 2 8 35 2 active sync /dev/sdc3 3 8 51 3 active sync /dev/sdd3
檢查
/mnt/HDA_ROOT/.logs/kmsg
後發現實際問題似乎是/dev/sdb3
:<6>[71052.730000] sd 3:0:0:0: [sdb] Unhandled sense code <6>[71052.730000] sd 3:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 <6>[71052.730000] sd 3:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor] <4>[71052.730000] Descriptor sense data with sense descriptors (in hex): <6>[71052.730000] 72 03 00 00 00 00 00 0c 00 0a 80 00 00 00 00 01 <6>[71052.730000] 5d 3e d9 c8 <6>[71052.730000] sd 3:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 <6>[71052.730000] sd 3:0:0:0: [sdb] CDB: cdb[0]=0x88: 88 00 00 00 00 01 5d 3e d9 c8 00 00 00 c0 00 00 <3>[71052.730000] end_request: I/O error, dev sdb, sector 5859367368 <4>[71052.730000] raid5_end_read_request: 27 callbacks suppressed <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246784 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246792 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246800 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246808 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246816 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246824 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246832 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246840 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246848 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246856 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
對於該範圍內的各種(隨機?)扇區,上述序列以穩定的速率重複
585724XXXX
。我的問題是:
- 為什麼它在接近尾聲時停止了,同時仍然使用了系統停止的這麼多資源(
md0_raid5
和md0_resync
程序仍在執行)。- 有什麼方法可以查看導致它失敗/失速的原因嗎?<– 可能是由於
sdb3
錯誤。- 如何在不失去 3TB 數據的情況下完成操作?(比如跳過麻煩的扇區
sdb3
,但保留完整的數據?)
它可能在完成之前停止,因為它需要故障磁碟返回某種狀態,但它沒有得到它。
無論如何,您的所有數據都是(或應該是)完整的,只有 4 個磁碟中的 3 個。
您說它會從陣列中彈出故障磁碟-因此它應該仍在執行,儘管處於降級模式。
可以裝嗎?
您可以通過執行以下操作強制陣列執行:
- 列印出數組的詳細資訊:
mdadm -D /dev/md0
- 停止數組:
mdadm --stop /dev/md0
- 重新創建數組並強制 md 接受它:
mdadm -C -n md0 --assume-clean /dev/sd[abcd]3
後一步是完全安全的,只要:
- 您不寫入數組,並且
- 您使用了與以前完全相同的創建參數。
最後一個標誌將阻止重建並跳過任何完整性測試。
然後,您應該能夠安裝它並恢復您的數據。