Linux

在 VirtualBox 中執行的 Linux 上的核心 oops 破壞了伺服器上一些與 IO 相關的功能

  • June 27, 2013

我們在 Windows 7 機器上的 VirtualBox 中執行 CentOS 6.3 版時遇到問題。症狀如下:

一切正常工作幾個小時,甚至幾天。然後發生了一些破壞系統的事情。

發生這種情況後我們仍然可以做什麼

  • 訪問網路伺服器
  • 使用現有的 SSH 會話執行 top 和 free

什麼不起作用

  • 啟動新的 SSH 會話(輸入使用者名和密碼後掛起)
  • 在現有 SSH 會話中執行 ls(掛起)
  • SSI 包括從我們的網路伺服器獲取遠端機器數據
  • 可能更多

發生這種情況時,我們在伺服器上看到的內容如下

  • 平均負載從基本上沒有到大約 3
  • CPU 使用率仍然很低 (5%)
  • 磁碟活動低(執行 iostat)
  • 大量可用記憶體
  • 大量可用磁碟空間

在 /var/log/messages 我們得到以下資訊:

Jun 14 01:10:48 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang
Jun 14 01:10:48 devvm kernel:  Tx Queue             <0>
Jun 14 01:10:48 devvm kernel:  TDH                  <2e>
Jun 14 01:10:48 devvm kernel:  TDT                  <30>
Jun 14 01:10:48 devvm kernel:  next_to_use          <30>
Jun 14 01:10:48 devvm kernel:  next_to_clean        <2e>
Jun 14 01:10:48 devvm kernel: buffer_info[next_to_clean]
Jun 14 01:10:48 devvm kernel:  time_stamp           <1038284db>
Jun 14 01:10:48 devvm kernel:  next_to_watch        <2f>
Jun 14 01:10:48 devvm kernel:  jiffies              <103828b42>
Jun 14 01:10:48 devvm kernel:  next_to_watch.status <0>
Jun 14 01:10:50 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang
Jun 14 01:10:50 devvm kernel:  Tx Queue             <0>
Jun 14 01:10:50 devvm kernel:  TDH                  <2e>
Jun 14 01:10:50 devvm kernel:  TDT                  <30>
Jun 14 01:10:50 devvm kernel:  next_to_use          <30>
Jun 14 01:10:50 devvm kernel:  next_to_clean        <2e>
Jun 14 01:10:50 devvm kernel: buffer_info[next_to_clean]
Jun 14 01:10:50 devvm kernel:  time_stamp           <1038284db>
Jun 14 01:10:50 devvm kernel:  next_to_watch        <2f>
Jun 14 01:10:50 devvm kernel:  jiffies              <103829312>
Jun 14 01:10:50 devvm kernel:  next_to_watch.status <0>
Jun 14 01:10:52 devvm kernel: ------------[ cut here ]------------
Jun 14 01:10:52 devvm kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jun 14 01:10:52 devvm kernel: Hardware name: VirtualBox
Jun 14 01:10:52 devvm kernel: NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Jun 14 01:10:52 devvm kernel: Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 14 01:10:52 devvm kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Jun 14 01:10:52 devvm kernel: Call Trace:
Jun 14 01:10:52 devvm kernel: <IRQ>  [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
Jun 14 01:10:52 devvm kernel: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
Jun 14 01:10:52 devvm kernel: [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280
Jun 14 01:10:52 devvm kernel: [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110
Jun 14 01:10:52 devvm kernel: [<ffffffff81459390>] ? dev_watchdog+0x0/0x280
Jun 14 01:10:52 devvm kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
Jun 14 01:10:52 devvm kernel: [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0
Jun 14 01:10:52 devvm kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
Jun 14 01:10:52 devvm kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
Jun 14 01:10:52 devvm kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
Jun 14 01:10:52 devvm kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
Jun 14 01:10:52 devvm kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
Jun 14 01:10:52 devvm kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
Jun 14 01:10:52 devvm kernel: [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jun 14 01:10:52 devvm kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
Jun 14 01:10:52 devvm kernel: <EOI>  [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
Jun 14 01:10:52 devvm kernel: [<ffffffff810149cd>] ? default_idle+0x4d/0xb0
Jun 14 01:10:52 devvm kernel: [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
Jun 14 01:10:52 devvm kernel: [<ffffffff814e433a>] ? rest_init+0x7a/0x80
Jun 14 01:10:52 devvm kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
Jun 14 01:10:52 devvm kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Jun 14 01:10:52 devvm kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Jun 14 01:10:52 devvm kernel: ---[ end trace 2c7bb984812cf120 ]---
Jun 14 01:10:52 devvm kernel: e1000 0000:00:03.0: eth0: Reset adapter
Jun 14 01:10:53 devvm abrtd: Directory 'oops-2013-06-14-01:10:53-1537-0' creation detected
Jun 14 01:10:53 devvm abrt-dump-oops: Reported 1 kernel oopses to Abrt
Jun 14 01:10:53 devvm abrtd: Can't open file '/var/spool/abrt/oops-2013-06-14-01:10:53-1537-0/uid': No such file or directory
Jun 14 01:10:55 devvm kernel: Bridge firewalling registered

在此之後,我們每隔兩分鐘看到一段時間:

Jun 14 01:14:22 devvm kernel: INFO: task events/0:19 blocked for more than 120 seconds.
Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 01:14:22 devvm kernel: events/0      D 0000000000000000     0    19      2 0x00000000
Jun 14 01:14:22 devvm kernel: ffff880116c4fb90 0000000000000046 00000000ffffffff 0000000000000008
Jun 14 01:14:22 devvm kernel: 0000000000016680 0000000000016680 ffff880028210400 0000000000016680
Jun 14 01:14:22 devvm kernel: ffff880116c4daf8 ffff880116c4ffd8 000000000000fb88 ffff880116c4daf8
Jun 14 01:14:22 devvm kernel: Call Trace:
Jun 14 01:14:22 devvm kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100975d>] ? __switch_to+0x13d/0x320
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0
Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffffa01c5ca5>] e1000_down_and_stop+0x25/0x50 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cb695>] e1000_down+0x155/0x200 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbcb0>] ? e1000_reset_task+0x0/0xe0 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbd1e>] e1000_reset_task+0x6e/0xe0 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
Jun 14 01:14:22 devvm kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Jun 14 01:14:22 devvm kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
Jun 14 01:14:22 devvm kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jun 14 01:14:22 devvm kernel: INFO: task parted:8069 blocked for more than 120 seconds.
Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 01:14:22 devvm kernel: parted        D 0000000000000003     0  8069   7994 0x00000080
Jun 14 01:14:22 devvm kernel: ffff8800908b3bb8 0000000000000082 0000000000000000 ffff88010ab50080
Jun 14 01:14:22 devvm kernel: ffff880116c7d500 0000000000000001 0000000000000000 0000000000000000
Jun 14 01:14:22 devvm kernel: ffff88010ab50638 ffff8800908b3fd8 000000000000fb88 ffff88010ab50638
Jun 14 01:14:22 devvm kernel: Call Trace:
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8112b6d0>] ? lru_add_drain_per_cpu+0x0/0x10
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d177>] flush_work+0x77/0xc0
Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d2f3>] schedule_on_each_cpu+0x133/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff811ad440>] ? invalidate_bh_lru+0x0/0x50
Jun 14 01:14:22 devvm kernel: [<ffffffff8112ae35>] lru_add_drain_all+0x15/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff811adf6a>] invalidate_bdev+0x2a/0x50
Jun 14 01:14:22 devvm kernel: [<ffffffff8125e9a4>] blkdev_ioctl+0x3b4/0x6e0
Jun 14 01:14:22 devvm kernel: [<ffffffff811b381c>] block_ioctl+0x3c/0x40
Jun 14 01:14:22 devvm kernel: [<ffffffff8118dec2>] vfs_ioctl+0x22/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8118e064>] do_vfs_ioctl+0x84/0x580
Jun 14 01:14:22 devvm kernel: [<ffffffff8118e5e1>] sys_ioctl+0x81/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

在 /var/spool/abrt/oops-2013-06-14-01:10:53-1537-0 我們可以看到以下資訊:

在回溯中:

WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: VirtualBox
NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280
[<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110
[<ffffffff81459390>] ? dev_watchdog+0x0/0x280
[<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
[<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
[<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
[<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81073ca5>] ? irq_exit+0x85/0x90
[<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
<EOI>  [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
[<ffffffff810149cd>] ? default_idle+0x4d/0xb0
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e433a>] ? rest_init+0x7a/0x80
[<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
[<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
[<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109

在命令行中:

ro root=/dev/mapper/vg_01-lv_root rd_NO_LUKS LANG=en_US.UTF-8  KEYBOARDTYPE=pc KEYTABLE=sv-latin1 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_01/lv_root crashkernel=129M@0M rhgb quiet rd_LVM_LV=vg_01/lv_swap rd_NO_DM rhgb quie

附加資訊:

# uname -a
Linux devvm 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
CentOS release 6.3 (Final)

VirtualBox 4.2.6 版。

任何有關我們如何進行故障排除的見解都值得讚賞。如果您需要更多資訊,請告訴我。

嘗試切換網卡類型。可能是其中一個驅動程序的錯誤。

引用自:https://serverfault.com/questions/515765