Linux

mdadm 在 99.9% 處停止重建 RAID5 陣列

  • January 14, 2013

我最近在我的 QNAP TS-412 NAS 中安裝了三個新磁碟。

這三個新磁碟應該與已經存在的磁碟組合成一個 4 磁碟 RAID5 陣列,所以我開始了遷移過程。

多次嘗試(每次大約需要 24 小時)後,遷移似乎成功,但導致 NAS 無響應。

那時我重置了 NAS。一切都從那裡走下坡路:

  • NAS 啟動,但將第一個磁碟標記為故障,並將其從所有陣列中刪除,使其癱瘓。
  • 我對磁碟進行了檢查,但找不到任何問題(無論如何這會很奇怪,因為它幾乎是新的)。
  • 管理界面沒有提供任何恢復選項,所以我想我只能手動完成。

我已經使用mdadm(being /dev/md4, /dev/md13and /dev/md9) 成功重建了所有 QNAP 內部 RAID1 陣列,只留下 RAID5 陣列;/dev/md0

我現在已經嘗試了多次,使用這些命令:

mdadm -w /dev/md0

(需要,因為陣列在從陣列中移除後由 NAS 以只讀方式安裝/dev/sda3。無法在 RO 模式下修改陣列)。

mdadm /dev/md0 --re-add /dev/sda3

之後陣列開始重建。但它停滯在 99.9%,而係統非常緩慢和/或無響應。(大多數情況下使用 SSH 登錄失敗)。

目前的情況:

[admin@nas01 ~]# cat /proc/mdstat                            
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md4 : active raid1 sdd2[2](S) sdc2[1] sdb2[0]
     530048 blocks [2/2] [UU]

md0 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1]
     8786092608 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/3] [_UUU]
     [===================>.]  recovery = 99.9% (2928697160/2928697536) finish=0.0min speed=110K/sec

md13 : active raid1 sda4[0] sdb4[1] sdd4[3] sdc4[2]
     458880 blocks [4/4] [UUUU]
     bitmap: 0/57 pages [0KB], 4KB chunk

md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
     530048 blocks [4/4] [UUUU]
     bitmap: 2/65 pages [8KB], 4KB chunk

unused devices: <none>

(現在已經停滯2928697160/2928697536了幾個小時)

[admin@nas01 ~]# mdadm -D /dev/md0
/dev/md0:
       Version : 01.00.03
 Creation Time : Thu Jan 10 23:35:00 2013
    Raid Level : raid5
    Array Size : 8786092608 (8379.07 GiB 8996.96 GB)
 Used Dev Size : 2928697536 (2793.02 GiB 2998.99 GB)
  Raid Devices : 4
 Total Devices : 4
Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Mon Jan 14 09:54:51 2013
         State : clean, degraded, recovering
Active Devices : 3
Working Devices : 4
Failed Devices : 0
 Spare Devices : 1

        Layout : left-symmetric
    Chunk Size : 64K

Rebuild Status : 99% complete

          Name : 3
          UUID : 0c43bf7b:282339e8:6c730d6b:98bc3b95
        Events : 34111

   Number   Major   Minor   RaidDevice State
      4       8        3        0      spare rebuilding   /dev/sda3
      1       8       19        1      active sync   /dev/sdb3
      2       8       35        2      active sync   /dev/sdc3
      3       8       51        3      active sync   /dev/sdd3

檢查/mnt/HDA_ROOT/.logs/kmsg後發現實際問題似乎是/dev/sdb3

<6>[71052.730000] sd 3:0:0:0: [sdb] Unhandled sense code
<6>[71052.730000] sd 3:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08
<6>[71052.730000] sd 3:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor]
<4>[71052.730000] Descriptor sense data with sense descriptors (in hex):
<6>[71052.730000]         72 03 00 00 00 00 00 0c 00 0a 80 00 00 00 00 01 
<6>[71052.730000]         5d 3e d9 c8 
<6>[71052.730000] sd 3:0:0:0: [sdb] ASC=0x0 ASCQ=0x0
<6>[71052.730000] sd 3:0:0:0: [sdb] CDB: cdb[0]=0x88: 88 00 00 00 00 01 5d 3e d9 c8 00 00 00 c0 00 00
<3>[71052.730000] end_request: I/O error, dev sdb, sector 5859367368
<4>[71052.730000] raid5_end_read_request: 27 callbacks suppressed
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246784 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246792 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246800 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246808 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246816 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246824 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246832 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246840 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246848 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246856 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.

對於該範圍內的各種(隨機?)扇區,上述序列以穩定的速率重複585724XXXX

我的問題是:

  • 為什麼它在接近尾聲時停止了,同時仍然使用了系統停止的這麼多資源(md0_raid5md0_resync程序仍在執行)。
  • 有什麼方法可以查看導致它失敗/失速的原因嗎?<– 可能是由於sdb3錯誤。
  • 如何在不失去 3TB 數據的情況下完成操作?(比如跳過麻煩的扇區sdb3,但保留完整的數據?)

它可能在完成之前停止,因為它需要故障磁碟返回某種狀態,但它沒有得到它。

無論如何,您的所有數據都是(或應該是)完整的,只有 4 個磁碟中的 3 個。

您說它會從陣列中彈出故障磁碟-因此它應該仍在執行,儘管處於降級模式。

可以裝嗎?

您可以通過執行以下操作強制陣列執行:

  • 列印出數組的詳細資訊:mdadm -D /dev/md0
  • 停止數組:mdadm --stop /dev/md0
  • 重新創建數組並強制 md 接受它:mdadm -C -n md0 --assume-clean /dev/sd[abcd]3

後一步是完全安全的,只要:

  • 您不寫入數組,並且
  • 您使用了與以前完全相同的創建參數。

最後一個標誌將阻止重建並跳過任何完整性測試。

然後,您應該能夠安裝它並恢復您的數據。

引用自:https://serverfault.com/questions/468804