Debian

HP DL360 G7 P410i 控制器故障排除

  • February 17, 2014

伺服器是帶有 P410i 磁碟控制器的 HP DL360 G7。2xE5620 CPU。16GB 記憶體。Linux mysql 2.6.32-5-amd64 #1 SMP Mon Feb 25 00:26:11 UTC 2013 x86_64 GNU/Linux (Debian 6.0.7)

hpacucli “ctrl all show status”

Smart Array P410i in Slot 0 (Embedded)
  Controller Status: OK
  Cache Status: OK
  Battery/Capacitor Status: OK

hpacucli “ctrl 全部顯示配置”

Smart Array P410i in Slot 0 (Embedded)    (sn: 5001438014555B80)

  array A (SAS, Unused Space: 0 MB)


     logicaldrive 1 (136.7 GB, RAID 1+0, OK)

     physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 72 GB, OK)
     physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 72 GB, OK)
     physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 72 GB, OK)
     physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 72 GB, OK)

  SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: 5001438014555B8F)

hpacucli “ctrl slot=0 ld 全部顯示”

Smart Array P410i in Slot 0 (Embedded)

  array A

     logicaldrive 1 (136.7 GB, RAID 1+0, OK)

我在夜間執行休閒腳本:

#!/bin/bash
mkdir -p /isotest
for i in {1..200}; do
   for j in {1..55}; do cp -v /root/ubuntu.iso /isotest/ubuntu.iso${j}; done
   rm /isotest/ubuntu.iso*;
done

/root/ubuntu.iso 大小約為 2 GB。

在 syslog 中有一些錯誤。我認為它與磁碟控制器有關:

Mar 28 06:59:17 mysql kernel: [850337.524306] INFO: task mandb:25565 blocked for more than 120 seconds.
Mar 28 06:59:17 mysql kernel: [850337.524337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 06:59:17 mysql kernel: [850337.524381] mandb         D ffff88022740fa20     0 25565  25197 0x00000000
Mar 28 06:59:17 mysql kernel: [850337.524385]  ffff88041ec4b880 0000000000000082 0000000000000000 000000009d778d11
Mar 28 06:59:17 mysql kernel: [850337.524388]  ffffea000defe260 ffffea000defe260 000000000000f9e0 ffff88014d913fd8
Mar 28 06:59:17 mysql kernel: [850337.524390]  00000000000157c0 00000000000157c0 ffff88013228a350 ffff88013228a648
Mar 28 06:59:17 mysql kernel: [850337.524393] Call Trace:
Mar 28 06:59:17 mysql kernel: [850337.524404]  [<ffffffff810168ec>] ? read_tsc+0xa/0x20
Mar 28 06:59:17 mysql kernel: [850337.524408]  [<ffffffff8106bdca>] ? timekeeping_get_ns+0xe/0x2e
Mar 28 06:59:17 mysql kernel: [850337.524412]  [<ffffffff810b4761>] ? sync_page+0x0/0x46
Mar 28 06:59:17 mysql kernel: [850337.524416]  [<ffffffff812fc8f2>] ? io_schedule+0x73/0xb7
Mar 28 06:59:17 mysql kernel: [850337.524418]  [<ffffffff810b47a2>] ? sync_page+0x41/0x46
Mar 28 06:59:17 mysql kernel: [850337.524421]  [<ffffffff812fcd02>] ? __wait_on_bit_lock+0x3f/0x84
Mar 28 06:59:17 mysql kernel: [850337.524423]  [<ffffffff810b472e>] ? __lock_page+0x5d/0x63
Mar 28 06:59:17 mysql kernel: [850337.524426]  [<ffffffff810652e0>] ? wake_bit_function+0x0/0x23
Mar 28 06:59:17 mysql kernel: [850337.524428]  [<ffffffff810b473d>] ? lock_page+0x9/0x1f
Mar 28 06:59:17 mysql kernel: [850337.524431]  [<ffffffff810b4853>] ? find_lock_page+0x25/0x45
Mar 28 06:59:17 mysql kernel: [850337.524433]  [<ffffffff810b4e63>] ? filemap_fault+0x1a5/0x2f6
Mar 28 06:59:17 mysql kernel: [850337.524438]  [<ffffffff810cadf2>] ? __do_fault+0x54/0x3c3
Mar 28 06:59:17 mysql kernel: [850337.524455]  [<ffffffffa01702d2>] ? __ext3_journal_stop+0x1f/0x3d [ext3]
Mar 28 06:59:17 mysql kernel: [850337.524458]  [<ffffffff810cd146>] ? handle_mm_fault+0x3b8/0x80f
Mar 28 06:59:17 mysql kernel: [850337.524461]  [<ffffffff81101d8e>] ? notify_change+0x2b3/0x2c5
Mar 28 06:59:17 mysql kernel: [850337.524464]  [<ffffffff81103eb5>] ? mntput_no_expire+0x23/0xee
Mar 28 06:59:17 mysql kernel: [850337.524467]  [<ffffffff81300096>] ? do_page_fault+0x2e0/0x2fc
Mar 28 06:59:17 mysql kernel: [850337.524469]  [<ffffffff812fdf35>] ? page_fault+0x25/0x30

沒有其他錯誤消息。

或者這個錯誤可能與記憶體有關?我已經在該伺服器上執行 memtest86+ 好幾天了,沒有任何錯誤。

當伺服器在數據中心時,我無法啟動伺服器。它一直顯示錯誤:

Fatal PCI Express Device Error PCI ? B00/D00/F00

將它運送到我的工作後,它可以正常啟動。在 ILO 事件日誌中有休閒錯誤:

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 0, Function 0, Error status 0x00000000)
Uncorrectable Memory Error ((Processor 1, Memory Module 2))
Uncorrectable Memory Error ((Processor 1, Memory Module 3))
An Unrecoverable System Error (NMI) has occurred (System error code 0x00000000, 0x00000000)

我已經將 BIOS、磁碟控制器和驅動器韌體更新到最新版本。

您有壞的 RAM 或系統板問題。我建議系統板故障,因為 Smart Array P410 控制器是板載的。

國際勞工組織的資訊非常具體。如果您查看hplog -v. 那是系統的 IML 日誌。

現在,我將重新安裝所有組件,看看是否可以讓系統以最小配置啟動:一個 CPU,最少安裝的 DIMM。

您還可以下載可引導的HP SmartStart .ISO並通過 ILO 載入它以執行診斷循環。

這是 G7 ProLiant,伺服器仍應在標准保修期內。致電惠普。

引用自:https://serverfault.com/questions/493153