Dell
DELL R320、Xeon E5-2450 v1、Oracle Linux 8 將時鐘源“tsc”標記為不穩定、負載下隨機崩潰
我最近獲得使用
Dell R320
,Xeon E5-2450 v1
所有韌體都使用 . 更新到最新版本Lifecycle controller
。在啟動 dmesg 報告時:microcode: microcode updated early to revision 0x71a, date = 2020-03-24 [ 12.384040] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large: [ 12.395572] clocksource: 'hpet' wd_now: 3b1bb82 wd_last: 2e247ff mask: ffffffff [ 12.413476] clocksource: 'tsc' cs_now: 1c62267fd4b cs_last: 1c30b8dcf7f mask: ffffffffffffffff [ 12.425567] tsc: Marking TSC unstable due to clocksource watchdog [ 12.431666] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
然後,如果我
phoronix-test-suite stress-run stress-ng
在 aprox 之後執行系統。一分鐘變得沒有反應。在測試期間,我看到來自網路適配器的看門狗事件:
[ 705.412997] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out [ 705.412997] WARNING: CPU: 9 PID: 6812 at net/sched/sch_generic.c:473 dev_watchdog+0x27d/0x281 [ 705.412997] Modules linked in: xt_CHECKSUM ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set tun rfkill scsi_transport_iscsi ip_set xt_conntrack xt_multiport xt_nat xt_addrtype xt_mark xt_MASQUERADE nft_counter xt_comment nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth sunrpc iTCO_wdt intel_rapl_msr iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel drm_vram_helper aesni_intel ttm crypto_simd cryptd glue_helper drm_kms_helper pcspkr drm syscopyarea sysfillrect sysimgblt fb_sys_fops lpc_ich i2c_algo_bit zfs(POE) joydev zunicode(POE) zzstd(OE) zlua(OE) mei_me zavl(POE) mei icp(POE) zcommon(POE) znvpair(POE) ipmi_ssif spl(OE) ioatdma dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter [ 705.412997] sch_fq_codel ip_tables xfs libcrc32c sd_mod sg ahci libahci libata mpt3sas tg3 raid_class scsi_transport_sas wmi fuse [ 705.412997] CPU: 9 PID: 6812 Comm: stress-ng Kdump: loaded Tainted: P OE 5.4.17-2136.300.7.el8uek.x86_64 #2 [ 705.412997] Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015 [ 705.412997] RIP: 0010:dev_watchdog+0x27d/0x281 [ 705.412997] Code: 48 85 c0 75 e6 eb a0 4c 89 e7 c6 05 9b 59 17 01 01 e8 c7 a9 fa ff 89 d9 4c 89 e6 48 c7 c7 68 3b 53 ac 48 89 c2 e8 be f1 82 ff <0f> 0b eb 82 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 [ 705.412997] RSP: 0000:ffffac6d003d0e50 EFLAGS: 00010282 [ 705.412997] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 [ 705.412997] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9e853f457d00 [ 705.412997] RBP: ffffac6d003d0e80 R08: 0000000000000514 R09: 00000000ffffffff [ 705.412997] R10: 0000000000000000 R11: ffff9e851d84f3d0 R12: ffff9e850d8e4000 [ 705.412997] R13: 0000000000000005 R14: ffff9e850d8e4480 R15: ffff9e8537d377c0 [ 705.412997] FS: 00007fa4baba5740(0000) GS:ffff9e853f440000(0000) knlGS:0000000000000000 [ 705.412997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 705.412997] CR2: 00007f54983fad0c CR3: 0000000b99992006 CR4: 00000000000606e0 [ 705.412997] Call Trace: [ 705.412997] <IRQ> [ 705.412997] ? pfifo_fast_enqueue+0x160/0x151 [ 705.412997] call_timer_fn+0x32/0x12c [ 705.412997] run_timer_softirq+0x1a5/0x42e [ 705.412997] __do_softirq+0xe1/0x2e7 [ 705.412997] ? hrtimer_interrupt+0x12a/0x222 [ 705.412997] irq_exit+0xf3/0xf8 [ 705.412997] smp_apic_timer_interrupt+0x79/0x130 [ 705.412997] apic_timer_interrupt+0xf/0x14 [ 705.412997] </IRQ>
如果我
mitigations = off
在啟動時添加到核心命令行參數,phoronix
持續 4 到 7 分鐘,系統再次變得無響應。同樣的事情發生在 KVM 客人身上,嘗試安裝Debian 11
5 次,在初始包安裝或核心解包期間安裝凍結。凍結消息螢幕: https ://ibb.co/k2Jk4QG
有沒有人有類似的問題?謝謝 !
PS:目前核心
5.4.17-2136.300.7.el8uek.x86_64
,也嘗試過4.18.0-305.19.1.el8_4.x86_64
沒有任何區別
將 CPU 切換到 E5-2470v2 解決了這個問題,似乎以前的 CPU 不知何故壞了。