Linux
在 VirtualBox 中執行的 Linux 上的核心 oops 破壞了伺服器上一些與 IO 相關的功能
我們在 Windows 7 機器上的 VirtualBox 中執行 CentOS 6.3 版時遇到問題。症狀如下:
一切正常工作幾個小時,甚至幾天。然後發生了一些破壞系統的事情。
發生這種情況後我們仍然可以做什麼:
- 訪問網路伺服器
- 使用現有的 SSH 會話執行 top 和 free
什麼不起作用:
- 啟動新的 SSH 會話(輸入使用者名和密碼後掛起)
- 在現有 SSH 會話中執行 ls(掛起)
- SSI 包括從我們的網路伺服器獲取遠端機器數據
- 可能更多
發生這種情況時,我們在伺服器上看到的內容如下:
- 平均負載從基本上沒有到大約 3
- CPU 使用率仍然很低 (5%)
- 磁碟活動低(執行 iostat)
- 大量可用記憶體
- 大量可用磁碟空間
在 /var/log/messages 我們得到以下資訊:
Jun 14 01:10:48 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang Jun 14 01:10:48 devvm kernel: Tx Queue <0> Jun 14 01:10:48 devvm kernel: TDH <2e> Jun 14 01:10:48 devvm kernel: TDT <30> Jun 14 01:10:48 devvm kernel: next_to_use <30> Jun 14 01:10:48 devvm kernel: next_to_clean <2e> Jun 14 01:10:48 devvm kernel: buffer_info[next_to_clean] Jun 14 01:10:48 devvm kernel: time_stamp <1038284db> Jun 14 01:10:48 devvm kernel: next_to_watch <2f> Jun 14 01:10:48 devvm kernel: jiffies <103828b42> Jun 14 01:10:48 devvm kernel: next_to_watch.status <0> Jun 14 01:10:50 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang Jun 14 01:10:50 devvm kernel: Tx Queue <0> Jun 14 01:10:50 devvm kernel: TDH <2e> Jun 14 01:10:50 devvm kernel: TDT <30> Jun 14 01:10:50 devvm kernel: next_to_use <30> Jun 14 01:10:50 devvm kernel: next_to_clean <2e> Jun 14 01:10:50 devvm kernel: buffer_info[next_to_clean] Jun 14 01:10:50 devvm kernel: time_stamp <1038284db> Jun 14 01:10:50 devvm kernel: next_to_watch <2f> Jun 14 01:10:50 devvm kernel: jiffies <103829312> Jun 14 01:10:50 devvm kernel: next_to_watch.status <0> Jun 14 01:10:52 devvm kernel: ------------[ cut here ]------------ Jun 14 01:10:52 devvm kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Jun 14 01:10:52 devvm kernel: Hardware name: VirtualBox Jun 14 01:10:52 devvm kernel: NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out Jun 14 01:10:52 devvm kernel: Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jun 14 01:10:52 devvm kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1 Jun 14 01:10:52 devvm kernel: Call Trace: Jun 14 01:10:52 devvm kernel: <IRQ> [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0 Jun 14 01:10:52 devvm kernel: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50 Jun 14 01:10:52 devvm kernel: [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280 Jun 14 01:10:52 devvm kernel: [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110 Jun 14 01:10:52 devvm kernel: [<ffffffff81459390>] ? dev_watchdog+0x0/0x280 Jun 14 01:10:52 devvm kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340 Jun 14 01:10:52 devvm kernel: [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0 Jun 14 01:10:52 devvm kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30 Jun 14 01:10:52 devvm kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0 Jun 14 01:10:52 devvm kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250 Jun 14 01:10:52 devvm kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 Jun 14 01:10:52 devvm kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 Jun 14 01:10:52 devvm kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90 Jun 14 01:10:52 devvm kernel: [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b Jun 14 01:10:52 devvm kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 Jun 14 01:10:52 devvm kernel: <EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10 Jun 14 01:10:52 devvm kernel: [<ffffffff810149cd>] ? default_idle+0x4d/0xb0 Jun 14 01:10:52 devvm kernel: [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 Jun 14 01:10:52 devvm kernel: [<ffffffff814e433a>] ? rest_init+0x7a/0x80 Jun 14 01:10:52 devvm kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430 Jun 14 01:10:52 devvm kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129 Jun 14 01:10:52 devvm kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109 Jun 14 01:10:52 devvm kernel: ---[ end trace 2c7bb984812cf120 ]--- Jun 14 01:10:52 devvm kernel: e1000 0000:00:03.0: eth0: Reset adapter Jun 14 01:10:53 devvm abrtd: Directory 'oops-2013-06-14-01:10:53-1537-0' creation detected Jun 14 01:10:53 devvm abrt-dump-oops: Reported 1 kernel oopses to Abrt Jun 14 01:10:53 devvm abrtd: Can't open file '/var/spool/abrt/oops-2013-06-14-01:10:53-1537-0/uid': No such file or directory Jun 14 01:10:55 devvm kernel: Bridge firewalling registered
在此之後,我們每隔兩分鐘看到一段時間:
Jun 14 01:14:22 devvm kernel: INFO: task events/0:19 blocked for more than 120 seconds. Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 14 01:14:22 devvm kernel: events/0 D 0000000000000000 0 19 2 0x00000000 Jun 14 01:14:22 devvm kernel: ffff880116c4fb90 0000000000000046 00000000ffffffff 0000000000000008 Jun 14 01:14:22 devvm kernel: 0000000000016680 0000000000016680 ffff880028210400 0000000000016680 Jun 14 01:14:22 devvm kernel: ffff880116c4daf8 ffff880116c4ffd8 000000000000fb88 ffff880116c4daf8 Jun 14 01:14:22 devvm kernel: Call Trace: Jun 14 01:14:22 devvm kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100975d>] ? __switch_to+0x13d/0x320 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180 Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0 Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffffa01c5ca5>] e1000_down_and_stop+0x25/0x50 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffffa01cb695>] e1000_down+0x155/0x200 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbcb0>] ? e1000_reset_task+0x0/0xe0 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbd1e>] e1000_reset_task+0x6e/0xe0 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0 Jun 14 01:14:22 devvm kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Jun 14 01:14:22 devvm kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0 Jun 14 01:14:22 devvm kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Jun 14 01:14:22 devvm kernel: INFO: task parted:8069 blocked for more than 120 seconds. Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 14 01:14:22 devvm kernel: parted D 0000000000000003 0 8069 7994 0x00000080 Jun 14 01:14:22 devvm kernel: ffff8800908b3bb8 0000000000000082 0000000000000000 ffff88010ab50080 Jun 14 01:14:22 devvm kernel: ffff880116c7d500 0000000000000001 0000000000000000 0000000000000000 Jun 14 01:14:22 devvm kernel: ffff88010ab50638 ffff8800908b3fd8 000000000000fb88 ffff88010ab50638 Jun 14 01:14:22 devvm kernel: Call Trace: Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180 Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8112b6d0>] ? lru_add_drain_per_cpu+0x0/0x10 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d177>] flush_work+0x77/0xc0 Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d2f3>] schedule_on_each_cpu+0x133/0x180 Jun 14 01:14:22 devvm kernel: [<ffffffff811ad440>] ? invalidate_bh_lru+0x0/0x50 Jun 14 01:14:22 devvm kernel: [<ffffffff8112ae35>] lru_add_drain_all+0x15/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff811adf6a>] invalidate_bdev+0x2a/0x50 Jun 14 01:14:22 devvm kernel: [<ffffffff8125e9a4>] blkdev_ioctl+0x3b4/0x6e0 Jun 14 01:14:22 devvm kernel: [<ffffffff811b381c>] block_ioctl+0x3c/0x40 Jun 14 01:14:22 devvm kernel: [<ffffffff8118dec2>] vfs_ioctl+0x22/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8118e064>] do_vfs_ioctl+0x84/0x580 Jun 14 01:14:22 devvm kernel: [<ffffffff8118e5e1>] sys_ioctl+0x81/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
在 /var/spool/abrt/oops-2013-06-14-01:10:53-1537-0 我們可以看到以下資訊:
在回溯中:
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: VirtualBox NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280 [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110 [<ffffffff81459390>] ? dev_watchdog+0x0/0x280 [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340 [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30 [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0 [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81073ca5>] ? irq_exit+0x85/0x90 [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10 [<ffffffff810149cd>] ? default_idle+0x4d/0xb0 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e433a>] ? rest_init+0x7a/0x80 [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430 [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
在命令行中:
ro root=/dev/mapper/vg_01-lv_root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=sv-latin1 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_01/lv_root crashkernel=129M@0M rhgb quiet rd_LVM_LV=vg_01/lv_swap rd_NO_DM rhgb quie
附加資訊:
# uname -a Linux devvm 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release CentOS release 6.3 (Final)
VirtualBox 4.2.6 版。
任何有關我們如何進行故障排除的見解都值得讚賞。如果您需要更多資訊,請告訴我。
嘗試切換網卡類型。可能是其中一個驅動程序的錯誤。