Ipmi

Supermicro BMC 看門狗導致的重啟

  • March 4, 2017

我最近購買了一塊 SuperMicro X10SLL-F 主機板,它內置了 BMC(Aspeed AST2400 晶片)。我想在伺服器上執行 linux 時使用內置的看門狗控制器(gentoo 加固)。

我在 BIOS 中啟用了看門狗功能,然後將主機板跳線從硬重置切換到 NMI(看門狗超時操作,用於測試目的以避免重新啟動)。關於軟體——我安裝並添加到預設執行級別看門狗程序(sys-apps/watchdog),該程序配置為每 10 秒 ping 一次看門狗設備(/dev/watchdog,存在)。看門狗超時設置為 250 秒。

程序顯然會看到看門狗硬體(啟用了 openipmi 的 ipmitool):

# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      254 sec
Present Countdown:      253 sec

免費ipmi:

# bmc-watchdog --get
Timer Use:                   SMS/OS
Timer:                       Running
Logging:                     Enabled
Timeout Action:              Hard Reset
Pre-Timeout Interrupt:       None
Pre-Timeout Interval:        0 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Set
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           254 seconds
Current Countdown:           253 seconds

但是,經過一段時間後,我得到了(上面的程序報告了良好的“目前倒計時”值):

[  294.107534] Uhhuh. NMI received for unknown reason 21 on CPU 0.
[  294.107998] Do you have a strange power saving mode enabled?
[  294.108437] Dazed and confused, but trying to continue

這是 NMI,顯然是由看門狗超時引起的。機器硬重置發生後不到一分鐘。

哪裡有問題,應該往哪個方向去探勘?

編輯:與 ipmi 相關的核心消息:

[    0.353090] ipmi message handler version 39.2
[    0.353353] ipmi device interface
[    0.353623] IPMI System Interface driver.
[    0.353898] ipmi_si: probing via ACPI
[    0.354172] ipmi_si 00:08: [io  0x0ca2] regsize 1 spacing 1 irq 0
[    0.354444] ipmi_si: Adding ACPI-specified kcs state machine
[    0.354790] ipmi_si: probing via SMBIOS
[    0.355051] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[    0.355317] ipmi_si: Adding SMBIOS-specified kcs state machine duplicate interface
[    0.355836] ipmi_si: probing via SPMI
[    0.356095] ipmi_si: SPMI: io 0xca2 regsize 1 spacing 1 irq 0
[    0.356362] ipmi_si: Adding SPMI-specified kcs state machine duplicate interface
[    0.356906] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
[    0.390536] ipmi_si: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
[    0.418476] ipmi_si 00:08: Found new BMC (man_id: 0x002a7c, prod_id: 0x0801, dev_id: 0x20)
[    0.419004] ipmi_si 00:08: IPMI kcs interface initialized
[    0.419272] IPMI SSIF Interface driver
[    0.420350] IPMI Watchdog: driver initialized
[    0.420635] Copyright (C) 2004 MontaVista Software - IPMI Powerdown via sys_reboot.
[    0.421444] IPMI poweroff: ATCA Detect mfg 0x2A7C prod 0x801
[    0.421710] IPMI poweroff: Found a chassis style poweroff function

編輯:我嘗試使用配置為“-u 4 -p 2 -a 0 -F -P -L -O -i 300 -e 10”的 bmc-watchdog。所以只有 SMS/OS 時間在使用,預超時中斷設置為 NMI,超時操作設置為 NONE:

# bmc-watchdog --get
Timer Use:                   SMS/OS
Timer:                       Running
Logging:                     Enabled
Timeout Action:              None
Pre-Timeout Interrupt:       NMI / Diagnostic Interrupt
Pre-Timeout Interval:        0 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Set
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           300 seconds
Current Countdown:           290 seconds

但這根本沒有改變。

編輯。此外,當我通過將 \0x00 回顯到 /dev/watchdog 來觸發看門狗計時器,然後將其保持不變時——系統在預設的 10 秒超時後正確重新啟動。所以看門狗工作得很好,但在啟動系統重新啟動後正好 350 秒。

編輯。我檢查了 BMC 系統事件日誌 (SEL) 並在重新啟動後發現:

Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
Sensor #202 | Watchdog 2 | Assertion Event | Timer expired, status only ; Timer use at expiration = SMS/OS ; Interrupt type = none

這裡有趣的是——該事件被標記為“僅狀態”。即便如此,系統也會重新啟動。當我故意觸發看門狗超時時,日誌是不同的:

Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
Sensor #202 | Watchdog 2 | Assertion Event | Hard Reset ; Timer use at expiration = SMS/OS ; Interrupt type = none

最後,我找到了一個有點奇怪的解決方案:讓看門狗跳線(JWD1)保持打開狀態(既沒有選擇 NMI 也沒有選擇硬重置)。看門狗在 BIOS 設置中啟用。

在這種情況下,看門狗按預期工作——系統穩定了 25 分鐘,bmc-watchdog 執行並在看門狗程序終止後重新啟動。

引用自:https://serverfault.com/questions/695650