Linux

3.13.0-71-generic 的核心錯誤

  • May 13, 2016

在使用 LTS 啟用堆棧(linux-generic-lts-trusty 3.13.0.40.35)將多台主機升級到 Ubuntu 12.04.5 LTS 後,我們看到核心錯誤突然激增。這些只是在使用幾天后才開始出現,並且(在我未經訓練的眼睛看來)似乎沒有太多共同點。

3.13.0-71-generic 中是否存在已知問題?我們能做些什麼來解決這個問題(或者至少弄清楚發生了什麼)?這些錯誤已經發生在現場,但我們還不能在內部在相同的硬體上重現它們,所以我們沒有機會查看升級到最新的 Trusty 核心是否可以解決問題。

呼叫軌跡如下:

Apr  4 23:35:37 hostname kernel: [319114.311718] INFO: task python2.7:5769 blocked for more than 300 seconds.
Apr  4 23:35:37 hostname kernel: [319114.311959]       Tainted: P           OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr  4 23:35:37 hostname kernel: [319114.312201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  4 23:35:37 hostname kernel: [319114.312454] python2.7       D ffffffff81811520     0  5769   5767 0x00000000
Apr  4 23:35:37 hostname kernel: [319114.312457]  ffff8800023c3be8 0000000000000082 ffff8800023c3ba8 ffff8800023c3fd8
Apr  4 23:35:37 hostname kernel: [319114.312459]  0000000000013180 0000000000013180 ffffffff81c144a0 ffff88000238b000
Apr  4 23:35:37 hostname kernel: [319114.312460]  ffff8800023c3bc8 ffff8805ae5374a8 ffff8805ae5374ac 00000000ffffffff
Apr  4 23:35:37 hostname kernel: [319114.312462] Call Trace:
Apr  4 23:35:37 hostname kernel: [319114.312467]  [<ffffffff81764799>] schedule+0x29/0x70
Apr  4 23:35:38 hostname kernel: [319114.312469]  [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10
Apr  4 23:35:38 hostname kernel: [319114.312470]  [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0
Apr  4 23:35:38 hostname kernel: [319114.312472]  [<ffffffff817669b3>] mutex_lock+0x23/0x37
Apr  4 23:35:38 hostname kernel: [319114.312474]  [<ffffffff811da631>] do_last+0x281/0x7d0
Apr  4 23:35:38 hostname kernel: [319114.312475]  [<ffffffff811dac44>] path_openat+0xc4/0x4c0
Apr  4 23:35:38 hostname kernel: [319114.312477]  [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360
Apr  4 23:35:38 hostname kernel: [319114.312478]  [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160
Apr  4 23:35:38 hostname kernel: [319114.312480]  [<ffffffff811dbed3>] do_filp_open+0x43/0xa0
Apr  4 23:35:38 hostname kernel: [319114.312483]  [<ffffffff811e900e>] ? __alloc_fd+0xce/0x120
Apr  4 23:35:38 hostname kernel: [319114.312486]  [<ffffffff811ca786>] do_sys_open+0x136/0x2a0
Apr  4 23:35:38 hostname kernel: [319114.312488]  [<ffffffff811ca90e>] SyS_open+0x1e/0x20
Apr  4 23:35:38 hostname kernel: [319114.312491]  [<ffffffff8177145d>] system_call_fastpath+0x1a/0x1f
Apr  4 23:35:38 hostname kernel: [319114.312496] INFO: task python2.7:6320 blocked for more than 300 seconds.
Apr  4 23:35:38 hostname kernel: [319114.312758]       Tainted: P           OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr  4 23:35:38 hostname kernel: [319114.313031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  4 23:35:38 hostname kernel: [319114.313319] python2.7       D ffffffff81811520     0  6320   6314 0x00000000
Apr  4 23:35:38 hostname kernel: [319114.313320]  ffff880021ebdbe8 0000000000000086 0000000000000286 ffff880021ebdfd8
Apr  4 23:35:40 hostname kernel: [319114.313322]  0000000000013180 0000000000013180 ffffffff81c144a0 ffff880002393000
Apr  4 23:35:40 hostname kernel: [319114.313323]  ffff880021ebdbc8 ffff8805ae5374a8 ffff8805ae5374ac 00000000ffffffff

Apr 4 15:00:41 hostname kernel: [191113.073832] INFO: task python2.7:8525 blocked for more than 300 seconds.
Apr 4 15:01:00 hostname kernel: [191113.073859] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 4 15:01:15 hostname kernel: [191113.073882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 4 15:01:15 hostname kernel: [191113.073906] python2.7 D 0000000000000000 0 8525 8517 0x00000000
Apr 4 15:01:15 hostname kernel: [191113.073909] ffff880212b3dbe8 0000000000000082 ffff880212b3dba8 ffff880212b3dfd8
Apr 4 15:01:15 hostname kernel: [191113.073911] 0000000000013180 0000000000013180 ffff88000e04e000 ffff8803251de000
Apr 4 15:01:15 hostname kernel: [191113.073913] ffff880212b3dbd8 ffff8802888190a8 ffff8802888190ac 00000000ffffffff
Apr 4 15:01:15 hostname kernel: [191113.073915] Call Trace:
Apr 4 15:01:15 hostname kernel: [191113.073921] [<ffffffff81764799>] schedule+0x29/0x70
Apr 4 15:01:15 hostname kernel: [191113.073923] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10
Apr 4 15:01:15 hostname kernel: [191113.073926] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0
Apr 4 15:01:15 hostname kernel: [191113.073927] [<ffffffff817669b3>] mutex_lock+0x23/0x37
Apr 4 15:01:15 hostname kernel: [191113.073930] [<ffffffff811da631>] do_last+0x281/0x7d0
Apr 4 15:01:15 hostname kernel: [191113.073931] [<ffffffff811dac44>] path_openat+0xc4/0x4c0
Apr 4 15:01:15 hostname kernel: [191113.073934] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360
Apr 4 15:01:15 hostname kernel: [191113.073935] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160

Apr  6 19:56:45 hostname kernel: [450264.877269] Out of memory: Kill process 26196 (python2.7) score 14 or sacrifice child
Apr  6 19:56:45 hostname kernel: [450264.877307] Killed process 26196 (python2.7) total-vm:76966004kB, anon-rss:88036kB, file-rss:170036kB
Apr  6 20:12:01 hostname kernel: [451123.424257] INFO: task cron:32543 blocked for more than 300 seconds.
Apr  6 20:12:01 hostname kernel: [451123.424286]       Tainted: P           OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr  6 20:12:01 hostname kernel: [451123.424312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 20:12:01 hostname kernel: [451123.424339] cron            D ffffffff81811520     0 32543   1398 0x00000000
Apr  6 20:12:01 hostname kernel: [451123.424343]  ffff880050453be8 0000000000000086 ffff880050453bd8 ffff880050453fd8
Apr  6 20:12:01 hostname kernel: [451123.424346]  0000000000013180 0000000000013180 ffff880873f20000 ffff88086e8d6000
Apr  6 20:12:01 hostname kernel: [451123.424348]  0000000000000286 ffff88086dedbb00 ffff88086dedbb04 00000000ffffffff
Apr  6 20:12:01 hostname kernel: [451123.424350] Call Trace:
Apr  6 20:12:01 hostname kernel: [451123.424356]  [<ffffffff81764799>] schedule+0x29/0x70
Apr  6 20:12:01 hostname kernel: [451123.424359]  [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10
Apr  6 20:12:01 hostname kernel: [451123.424362]  [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0
Apr  6 20:12:01 hostname kernel: [451123.424364]  [<ffffffff817669b3>] mutex_lock+0x23/0x37
Apr  6 20:12:01 hostname kernel: [451123.424366]  [<ffffffff811da631>] do_last+0x281/0x7d0
Apr  6 20:12:01 hostname kernel: [451123.424368]  [<ffffffff811dac44>] path_openat+0xc4/0x4c0
Apr  6 20:12:01 hostname kernel: [451123.424371]  [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360
Apr  6 20:12:01 hostname kernel: [451123.424373]  [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160
Apr  6 20:12:01 hostname kernel: [451123.424375]  [<ffffffff811dbed3>] do_filp_open+0x43/0xa0
Apr  6 20:12:01 hostname kernel: [451123.424378]  [<ffffffff811e900e>] ? __alloc_fd+0xce/0x120
Apr  6 20:12:31 hostname kernel: [451123.424381]  [<ffffffff811ca786>] do_sys_open+0x136/0x2a0
Apr  6 20:12:31 hostname kernel: [451123.424383]  [<ffffffff811ca90e>] SyS_open+0x1e/0x20
Apr  6 20:12:31 hostname kernel: [451123.424387]  [<ffffffff8177145d>] system_call_fastpath+0x1a/0x1f

這可能是記性不好:

Apr 5 19:58:53 hostname kernel: [462034.034881] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr 5 19:58:53 hostname kernel: [462034.034885] apache2 cpuset=/ mems_allowed=0
Apr 5 19:58:53 hostname kernel: [462034.034888] CPU: 6 PID: 19720 Comm: apache2 Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 5 19:58:53 hostname kernel: [462034.034889] Hardware name: Supermicro C7Z87-OCE/C7Z87-OCE, BIOS 2.2 01/30/2015
Apr 5 19:58:53 hostname kernel: [462034.034890] 0000000000000000 ffff88089e46b888 ffffffff8175bca1 0000000000000007
Apr 5 19:58:53 hostname kernel: [462034.034893] ffff880203b91800 ffff88089e46b8d8 ffffffff8175172b ffff880800000000
Apr 5 19:58:53 hostname kernel: [462034.034895] 000201da81381898 ffff88001e730000 ffff880003f28000 0000000000000000
Apr 5 19:58:53 hostname kernel: [462034.034897] Call Trace:
Apr 5 19:58:53 hostname kernel: [462034.034902] [<ffffffff8175bca1>] dump_stack+0x46/0x58
Apr 5 19:58:53 hostname kernel: [462034.034905] [<ffffffff8175172b>] dump_header+0x7e/0xbd
Apr 5 19:58:53 hostname kernel: [462034.034907] [<ffffffff817517c1>] oom_kill_process.part.5+0x57/0x2d7
Apr 5 19:58:53 hostname kernel: [462034.034910] [<ffffffff8115cb27>] oom_kill_process+0x47/0x50
Apr 5 19:58:53 hostname kernel: [462034.034912] [<ffffffff8115ce65>] out_of_memory+0x145/0x1d0
Apr 5 19:58:53 hostname kernel: [462034.034915] [<ffffffff81162e17>] __alloc_pages_nodemask+0xab7/0xbb0
Apr 5 19:58:53 hostname kernel: [462034.034919] [<ffffffff811a4102>] alloc_pages_current+0xb2/0x170
Apr 5 19:58:53 hostname kernel: [462034.034921] [<ffffffff811591c7>] __page_cache_alloc+0xb7/0xd0
Apr 5 19:58:53 hostname kernel: [462034.034923] [<ffffffff8115afbd>] filemap_fault+0x28d/0x440
Apr 5 19:58:53 hostname kernel: [462034.034926] [<ffffffff811811ef>] __do_fault+0x6f/0x530
Apr 5 19:58:53 hostname kernel: [462034.034928] [<ffffffff81185046>] handle_pte_fault+0x96/0x230
Apr 5 19:58:53 hostname kernel: [462034.034930] [<ffffffff81764799>] ? schedule+0x29/0x70
Apr 5 19:58:53 hostname kernel: [462034.034932] [<ffffffff811855eb>] __handle_mm_fault+0x1db/0x360
Apr 5 19:58:53 hostname kernel: [462034.034934] [<ffffffff81185823>] handle_mm_fault+0xb3/0x160
Apr 5 19:58:53 hostname kernel: [462034.034937] [<ffffffff8176c720>] __do_page_fault+0x1b0/0x580
Apr 5 19:58:53 hostname kernel: [462034.034940] [<ffffffff8101ce89>] ? read_tsc+0x9/0x20
Apr 5 19:58:53 hostname kernel: [462034.034943] [<ffffffff810d329c>] ? ktime_get_ts+0x4c/0xe0
Apr 5 19:58:53 hostname kernel: [462034.034946] [<ffffffff811deb4d>] ? poll_select_copy_remaining+0xed/0x140
Apr 5 19:58:53 hostname kernel: [462034.034948] [<ffffffff8176cb0a>] do_page_fault+0x1a/0x70
Apr 5 19:58:53 hostname kernel: [462034.034950] [<ffffffff81768b28>] page_fault+0x28/0x30

因為我必須回答這個問題才能關閉它:根據邁克爾漢普頓的評論,更新核心(到 .85)解決了這個問題。

引用自:https://serverfault.com/questions/769592