Linux

RAID 1 中的第二個驅動器不斷出現故障

  • April 27, 2011

我這裡有點問題。我有一個 Ubuntu Linux 伺服器,在軟體 RAID 1(使用 mdadm 創建)中設置了 2 個 SAS 驅動器。RAID 可以正常執行一天,我可以執行 cat /proc/mdstat 並顯示兩個磁碟都處於活動狀態並且一切正常。然後出乎意料的是,第二個磁碟將出現故障並進入降級模式。

然後,我將從 RAID 集中移除磁碟,重新啟動伺服器,然後將磁碟重新添加到該集中。RAID 將自行重建而不會出現任何問題,並且我將擁有一個健康的 RAID 1 再次使用相同的磁碟工作。再說一次,在 12-24 小時左右,第二個驅動器將出現故障。

硬碟是全新的,所以我認為硬體還可以。這是磁碟發生故障時我能夠從 kern.log 和 syslog 中擷取的輸出。

任何人都可以翻譯這個或知道可能發生的事情嗎?

謝謝!

核心日誌

Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.180815] sd 2:0:0:0: Attached scsi generic sg1 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.181086] sd 2:0:1:0: Attached scsi generic sg2 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.181376] sd 2:0:1:0: [sdb] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182584] sd 2:0:1:0: [sdb] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182591] sd 2:0:1:0: [sdb] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182835] sd 2:0:0:0: [sda] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.183802] sd 2:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.185146] sd 2:0:0:0: [sda] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.185151] sd 2:0:0:0: [sda] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.188191] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.191403] sd 2:0:1:0: [sdb] Attached SCSI disk
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.299351] sd 2:0:0:0: [sda] Attached SCSI disk
Mar  1 09:01:22 CSTEP-APPS20 kernel: [44807.010040] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:01:32 CSTEP-APPS20 kernel: [44817.560056] sd 2:0:1:0: [sdb] CDB: Test Unit Ready: 00 00 00 00 00 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device

和系統日誌

Mar  1 09:01:43 CSTEP-APPS20 kernel: [44827.860060] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!!
Mar  1 09:01:43 CSTEP-APPS20 kernel: [44827.860070] mptbase: ioc0: Initiating recovery
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470023] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470030] mptscsih: ioc0: attempting task abort! (sc=ffff880156fa4c00)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470050] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880156fa4c00)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470073] scsi target2:0:0: Beginning Domain Validation
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720120] mptscsih: ioc0: attempting target reset! (sc=ffff88016197b400)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.262008] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512073] mptscsih: ioc0: attempting bus reset! (sc=ffff88016197b400)
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:05 CSTEP-APPS20 kernel: [44850.046491] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:15 CSTEP-APPS20 kernel: [44860.553909] mptscsih: ioc0: attempting host reset! (sc=ffff88016197b400)
Mar  1 09:02:15 CSTEP-APPS20 kernel: [44860.553915] mptbase: ioc0: Initiating recovery
Mar  1 09:02:35 CSTEP-APPS20 kernel: [44879.870026] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380429] end_request: I/O error, dev sdb, sector 55297928
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380562] __ratelimit: 24 callbacks suppressed
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380566] raid1: sdb1: rescheduling sector 55295880
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380695] end_request: I/O error, dev sdb, sector 55297984
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380817] raid1: sdb1: rescheduling sector 55295936
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381019] end_request: I/O error, dev sdb, sector 63983488
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381142] md: super_written gets error=-5, uptodate=0
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381146] raid1: Disk failure on sdb1, disabling device.
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381148] raid1: Operation continuing on 1 devices.
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398144] scsi target2:0:0: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398226] scsi target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI WRFLOW PCOMP (6.25 ns, offset 127)
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398295] scsi target2:0:1: Beginning Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648493] scsi target2:0:1: Domain Validation Initial Inquiry Failed
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648623] scsi target2:0:1: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648691] scsi target2:0:1: asynchronous
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648760] scsi target2:0:8: Beginning Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.649386] scsi target2:0:8: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.649458] scsi target2:0:8: asynchronous
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653384] RAID1 conf printout:
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653390]  --- wd:1 rd:2
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653395]  disk 0, wo:0, o:1, dev:sda1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653399]  disk 1, wo:1, o:0, dev:sdb1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693763] RAID1 conf printout:
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693767]  --- wd:1 rd:2
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693771]  disk 0, wo:0, o:1, dev:sda1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.714266] raid1: sda1: redirecting sector 55295880 to another mirror
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.719943] raid1: sda1: redirecting sector 55295936 to another mirror

看起來設備 /dev/sdb 正在離線。您可能有佈線問題,但很可能是磁碟。當然也可能與磁碟韌體和控制器發生衝突。

我會立即在磁碟上執行製造商的診斷程序。僅僅因為它們是全新的,我不會懷疑它們有缺陷。(事實上,作為全新的,我懷疑它們比執行了幾個月的磁碟要多一點。)

我不明白你為什麼假設驅動器沒問題。即使是新驅動器也會出現故障。哎呀,根據我的專業經驗,嬰兒死亡率與硬碟驅動器中的老年人死亡率一樣普遍。這就是為什麼許多商店會為他們的設備設置老化期。

用已知良好的驅動器更換驅動器,看看會發生什麼,或者至少通過 SMART 或診斷工具查看壞塊的數量。

引用自:https://serverfault.com/questions/241749