Memory

啟動後伺服器核心崩潰,不知道如何處理日誌

  • November 3, 2018

我們剛剛收到了一個全新的雙 CPU 伺服器,它在啟動後不久就不斷崩潰並出現核心恐慌,這甚至發生在作業系統空閒時的設置過程中。我能夠安裝作業系統並啟用 mcelog 來嘗試了解正在發生的事情,儘管我不確定輸出是什麼。線上閱讀使我認為這可能是其中一個插槽 (1) 上的 DIMM 有缺陷,但我執行 memtest 幾次,沒有發現任何錯誤。這可能是軟體問題嗎?我已經嘗試了 2 個作業系統,並且兩者都發生了同樣的事情,儘管在 Debian/Proxmox 中比在 CentOS 中更常見。

伺服器規格:

雙英特爾 8 核至強 E5-2620v4

2 x DIMM 32GB DDR4 2400MHz RECC DDR4

MB 超微 X10DRL-i

這不是 CPU 溫度,因為在 memtest 或作業系統安裝期間,溫度從未超過 35ºC。我還能夠在 CPU 崩潰並且溫度正常之前在 CPU 上執行一些短褲基準測試。

我怎麼知道這裡發生了什麼?在它發生之前我可以訪問伺服器幾分鐘,我已經下載了 vmcore 轉儲,但我不確定如何處理它。

這是啟動然後崩潰 50 秒後的 mce 日誌:

[   56.367615] mce: [Hardware Error]: Machine check events logged
[   70.420914] mce: [Hardware Error]: Machine check events logged
[   71.886789] Disabling lock debugging due to kernel taint
[   71.886894] mce: [Hardware Error]: CPU 24: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.887009] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.887122] mce: [Hardware Error]: TSC 206cc7cd362 
[   71.887184] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 11 microcode b00001d
[   71.887289] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.889392] mce: [Hardware Error]: CPU 30: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.889489] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.889595] mce: [Hardware Error]: TSC 206cc7cd11d 
[   71.889657] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1d microcode b00001d
[   71.889760] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.891804] mce: [Hardware Error]: CPU 14: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.891901] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.892007] mce: [Hardware Error]: TSC 206cc7cd10e 
[   71.892068] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1c microcode b00001d
[   71.892171] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.894217] mce: [Hardware Error]: CPU 13: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.894314] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.894420] mce: [Hardware Error]: TSC 206cc7cd23c 
[   71.894480] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1a microcode b00001d
[   71.894585] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.896634] mce: [Hardware Error]: CPU 29: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.896730] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.896835] mce: [Hardware Error]: TSC 206cc7cd194 
[   71.896896] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1b microcode b00001d
[   71.897000] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.899053] mce: [Hardware Error]: CPU 28: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.899150] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.899256] mce: [Hardware Error]: TSC 206cc7cd719 
[   71.899335] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 19 microcode b00001d
[   71.899438] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.901485] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.901582] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.901687] mce: [Hardware Error]: TSC 206cc7cd720 
[   71.901748] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 18 microcode b00001d
[   71.901851] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.903934] mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.904031] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.904136] mce: [Hardware Error]: TSC 206cc7cd851 
[   71.904197] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 14 microcode b00001d
[   71.904300] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.906306] mce: [Hardware Error]: CPU 26: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.906403] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.906508] mce: [Hardware Error]: TSC 206cc7cd863 
[   71.906569] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 15 microcode b00001d
[   71.909482] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.914367] mce: [Hardware Error]: CPU 11: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.917304] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.920287] mce: [Hardware Error]: TSC 206cc7cd515 
[   71.923159] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 16 microcode b00001d
[   71.926031] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   71.930820] mce: [Hardware Error]: CPU 27: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.933685] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.936557] mce: [Hardware Error]: TSC 206cc7cd449 
[   71.939384] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 17 microcode b00001d
[   71.944180] mce: [Hardware Error]: CPU 9: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.947059] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.949956] mce: [Hardware Error]: TSC 206cc7cd766 
[   71.952786] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 12 microcode b00001d
[   71.957580] mce: [Hardware Error]: CPU 25: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.960480] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.963366] mce: [Hardware Error]: TSC 206cc7cd751 
[   71.966210] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 13 microcode b00001d
[   71.971031] mce: [Hardware Error]: CPU 31: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.973919] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.976817] mce: [Hardware Error]: TSC 206cc7cd7f7 
[   71.979690] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1f microcode b00001d
[   71.984474] mce: [Hardware Error]: CPU 15: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   71.987371] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   71.990290] mce: [Hardware Error]: TSC 206cc7cd803 
[   71.993151] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1e microcode b00001d
[   71.997992] mce: [Hardware Error]: CPU 8: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[   72.000918] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[   72.003828] mce: [Hardware Error]: TSC 206cc7cd374 
[   72.006692] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 10 microcode b00001d
[   72.011533] mce: [Hardware Error]: Machine check: Processor context corrupt
[   72.014436] Kernel panic - not syncing: Fatal machine check

我知道遲到的回复,但我完全忘記了。原來是其中一個 CPU 放置不當,或者在運輸過程中鬆動了。至少那是供應商告訴我的,因為他們說他們沒有更換任何東西。

他們把它運回來後,一切正常。

引用自:https://serverfault.com/questions/833571