Linux

記錄機器檢查事件

  • May 15, 2019

在 /var/log/messages 中,發生了這個錯誤:

Sep 19 13:18:15 wdc kernel: [2772302.630416] Machine check events logged

不久之後,整個伺服器變得沒有響應。這是 Xen 伺服器的 Dom0 日誌(在 Debian Squeeze 上執行最新版本)。

任何人都可以闡明這個錯誤的含義嗎?我應該訂購新硬體嗎?

編輯:另外,它似乎暗示它記錄了一些東西,我在哪裡可以找到它?

有關更多資訊,請檢查日誌文件(此日誌文件可能存在也可能不存在,這取決於它在 /etc/mcelog/mcelog.conf 中的配置方式)應該在哪裡找到問題的詳細描述。

/var/log/mcelog

或者只是執行命令

mcelog

Mcelog 正在解碼 x86 機器上的核心機器檢查日誌。來自man mcelog

X86  CPUs  report  errors  detected by the CPU as machine check events (MCEs).  These
can be data corruption detected in the CPU caches, in main memory by an integrated
memory controller, data transfer errors on the front side bus or CPU interconnect or
other internal errors. Possible causes can be cosmic radiation, instable power
supplies, cooling problems, broken hardware, or bad luck.
Most  errors  can  be  corrected by the CPU by internal error correction mechanisms.
Uncorrected errors cause machine check exceptions which may panic the machine.
When a corrected error happens the x86 kernel writes a record describing the MCE into
a internal ring buffer available through  the  /dev/mcelog device  mcelog retrieves
errors from /dev/mcelog, decodes them into a human readable format and prints them on
the standard output or optionally into the system log.

您可以在項目網頁Mcelog 項目網頁 上找到有關 mcelog 及其配置/錯誤/觸發器的更多資訊

引用自:https://serverfault.com/questions/430005