Linux
3.13.0-71-generic 的核心錯誤
在使用 LTS 啟用堆棧(linux-generic-lts-trusty 3.13.0.40.35)將多台主機升級到 Ubuntu 12.04.5 LTS 後,我們看到核心錯誤突然激增。這些只是在使用幾天后才開始出現,並且(在我未經訓練的眼睛看來)似乎沒有太多共同點。
3.13.0-71-generic 中是否存在已知問題?我們能做些什麼來解決這個問題(或者至少弄清楚發生了什麼)?這些錯誤已經發生在現場,但我們還不能在內部在相同的硬體上重現它們,所以我們沒有機會查看升級到最新的 Trusty 核心是否可以解決問題。
呼叫軌跡如下:
Apr 4 23:35:37 hostname kernel: [319114.311718] INFO: task python2.7:5769 blocked for more than 300 seconds. Apr 4 23:35:37 hostname kernel: [319114.311959] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu Apr 4 23:35:37 hostname kernel: [319114.312201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 4 23:35:37 hostname kernel: [319114.312454] python2.7 D ffffffff81811520 0 5769 5767 0x00000000 Apr 4 23:35:37 hostname kernel: [319114.312457] ffff8800023c3be8 0000000000000082 ffff8800023c3ba8 ffff8800023c3fd8 Apr 4 23:35:37 hostname kernel: [319114.312459] 0000000000013180 0000000000013180 ffffffff81c144a0 ffff88000238b000 Apr 4 23:35:37 hostname kernel: [319114.312460] ffff8800023c3bc8 ffff8805ae5374a8 ffff8805ae5374ac 00000000ffffffff Apr 4 23:35:37 hostname kernel: [319114.312462] Call Trace: Apr 4 23:35:37 hostname kernel: [319114.312467] [<ffffffff81764799>] schedule+0x29/0x70 Apr 4 23:35:38 hostname kernel: [319114.312469] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10 Apr 4 23:35:38 hostname kernel: [319114.312470] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0 Apr 4 23:35:38 hostname kernel: [319114.312472] [<ffffffff817669b3>] mutex_lock+0x23/0x37 Apr 4 23:35:38 hostname kernel: [319114.312474] [<ffffffff811da631>] do_last+0x281/0x7d0 Apr 4 23:35:38 hostname kernel: [319114.312475] [<ffffffff811dac44>] path_openat+0xc4/0x4c0 Apr 4 23:35:38 hostname kernel: [319114.312477] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360 Apr 4 23:35:38 hostname kernel: [319114.312478] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160 Apr 4 23:35:38 hostname kernel: [319114.312480] [<ffffffff811dbed3>] do_filp_open+0x43/0xa0 Apr 4 23:35:38 hostname kernel: [319114.312483] [<ffffffff811e900e>] ? __alloc_fd+0xce/0x120 Apr 4 23:35:38 hostname kernel: [319114.312486] [<ffffffff811ca786>] do_sys_open+0x136/0x2a0 Apr 4 23:35:38 hostname kernel: [319114.312488] [<ffffffff811ca90e>] SyS_open+0x1e/0x20 Apr 4 23:35:38 hostname kernel: [319114.312491] [<ffffffff8177145d>] system_call_fastpath+0x1a/0x1f Apr 4 23:35:38 hostname kernel: [319114.312496] INFO: task python2.7:6320 blocked for more than 300 seconds. Apr 4 23:35:38 hostname kernel: [319114.312758] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu Apr 4 23:35:38 hostname kernel: [319114.313031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 4 23:35:38 hostname kernel: [319114.313319] python2.7 D ffffffff81811520 0 6320 6314 0x00000000 Apr 4 23:35:38 hostname kernel: [319114.313320] ffff880021ebdbe8 0000000000000086 0000000000000286 ffff880021ebdfd8 Apr 4 23:35:40 hostname kernel: [319114.313322] 0000000000013180 0000000000013180 ffffffff81c144a0 ffff880002393000 Apr 4 23:35:40 hostname kernel: [319114.313323] ffff880021ebdbc8 ffff8805ae5374a8 ffff8805ae5374ac 00000000ffffffff
Apr 4 15:00:41 hostname kernel: [191113.073832] INFO: task python2.7:8525 blocked for more than 300 seconds. Apr 4 15:01:00 hostname kernel: [191113.073859] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu Apr 4 15:01:15 hostname kernel: [191113.073882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 4 15:01:15 hostname kernel: [191113.073906] python2.7 D 0000000000000000 0 8525 8517 0x00000000 Apr 4 15:01:15 hostname kernel: [191113.073909] ffff880212b3dbe8 0000000000000082 ffff880212b3dba8 ffff880212b3dfd8 Apr 4 15:01:15 hostname kernel: [191113.073911] 0000000000013180 0000000000013180 ffff88000e04e000 ffff8803251de000 Apr 4 15:01:15 hostname kernel: [191113.073913] ffff880212b3dbd8 ffff8802888190a8 ffff8802888190ac 00000000ffffffff Apr 4 15:01:15 hostname kernel: [191113.073915] Call Trace: Apr 4 15:01:15 hostname kernel: [191113.073921] [<ffffffff81764799>] schedule+0x29/0x70 Apr 4 15:01:15 hostname kernel: [191113.073923] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10 Apr 4 15:01:15 hostname kernel: [191113.073926] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0 Apr 4 15:01:15 hostname kernel: [191113.073927] [<ffffffff817669b3>] mutex_lock+0x23/0x37 Apr 4 15:01:15 hostname kernel: [191113.073930] [<ffffffff811da631>] do_last+0x281/0x7d0 Apr 4 15:01:15 hostname kernel: [191113.073931] [<ffffffff811dac44>] path_openat+0xc4/0x4c0 Apr 4 15:01:15 hostname kernel: [191113.073934] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360 Apr 4 15:01:15 hostname kernel: [191113.073935] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160
Apr 6 19:56:45 hostname kernel: [450264.877269] Out of memory: Kill process 26196 (python2.7) score 14 or sacrifice child Apr 6 19:56:45 hostname kernel: [450264.877307] Killed process 26196 (python2.7) total-vm:76966004kB, anon-rss:88036kB, file-rss:170036kB Apr 6 20:12:01 hostname kernel: [451123.424257] INFO: task cron:32543 blocked for more than 300 seconds. Apr 6 20:12:01 hostname kernel: [451123.424286] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu Apr 6 20:12:01 hostname kernel: [451123.424312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 6 20:12:01 hostname kernel: [451123.424339] cron D ffffffff81811520 0 32543 1398 0x00000000 Apr 6 20:12:01 hostname kernel: [451123.424343] ffff880050453be8 0000000000000086 ffff880050453bd8 ffff880050453fd8 Apr 6 20:12:01 hostname kernel: [451123.424346] 0000000000013180 0000000000013180 ffff880873f20000 ffff88086e8d6000 Apr 6 20:12:01 hostname kernel: [451123.424348] 0000000000000286 ffff88086dedbb00 ffff88086dedbb04 00000000ffffffff Apr 6 20:12:01 hostname kernel: [451123.424350] Call Trace: Apr 6 20:12:01 hostname kernel: [451123.424356] [<ffffffff81764799>] schedule+0x29/0x70 Apr 6 20:12:01 hostname kernel: [451123.424359] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10 Apr 6 20:12:01 hostname kernel: [451123.424362] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0 Apr 6 20:12:01 hostname kernel: [451123.424364] [<ffffffff817669b3>] mutex_lock+0x23/0x37 Apr 6 20:12:01 hostname kernel: [451123.424366] [<ffffffff811da631>] do_last+0x281/0x7d0 Apr 6 20:12:01 hostname kernel: [451123.424368] [<ffffffff811dac44>] path_openat+0xc4/0x4c0 Apr 6 20:12:01 hostname kernel: [451123.424371] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360 Apr 6 20:12:01 hostname kernel: [451123.424373] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160 Apr 6 20:12:01 hostname kernel: [451123.424375] [<ffffffff811dbed3>] do_filp_open+0x43/0xa0 Apr 6 20:12:01 hostname kernel: [451123.424378] [<ffffffff811e900e>] ? __alloc_fd+0xce/0x120 Apr 6 20:12:31 hostname kernel: [451123.424381] [<ffffffff811ca786>] do_sys_open+0x136/0x2a0 Apr 6 20:12:31 hostname kernel: [451123.424383] [<ffffffff811ca90e>] SyS_open+0x1e/0x20 Apr 6 20:12:31 hostname kernel: [451123.424387] [<ffffffff8177145d>] system_call_fastpath+0x1a/0x1f
這可能是記性不好:
Apr 5 19:58:53 hostname kernel: [462034.034881] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Apr 5 19:58:53 hostname kernel: [462034.034885] apache2 cpuset=/ mems_allowed=0 Apr 5 19:58:53 hostname kernel: [462034.034888] CPU: 6 PID: 19720 Comm: apache2 Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu Apr 5 19:58:53 hostname kernel: [462034.034889] Hardware name: Supermicro C7Z87-OCE/C7Z87-OCE, BIOS 2.2 01/30/2015 Apr 5 19:58:53 hostname kernel: [462034.034890] 0000000000000000 ffff88089e46b888 ffffffff8175bca1 0000000000000007 Apr 5 19:58:53 hostname kernel: [462034.034893] ffff880203b91800 ffff88089e46b8d8 ffffffff8175172b ffff880800000000 Apr 5 19:58:53 hostname kernel: [462034.034895] 000201da81381898 ffff88001e730000 ffff880003f28000 0000000000000000 Apr 5 19:58:53 hostname kernel: [462034.034897] Call Trace: Apr 5 19:58:53 hostname kernel: [462034.034902] [<ffffffff8175bca1>] dump_stack+0x46/0x58 Apr 5 19:58:53 hostname kernel: [462034.034905] [<ffffffff8175172b>] dump_header+0x7e/0xbd Apr 5 19:58:53 hostname kernel: [462034.034907] [<ffffffff817517c1>] oom_kill_process.part.5+0x57/0x2d7 Apr 5 19:58:53 hostname kernel: [462034.034910] [<ffffffff8115cb27>] oom_kill_process+0x47/0x50 Apr 5 19:58:53 hostname kernel: [462034.034912] [<ffffffff8115ce65>] out_of_memory+0x145/0x1d0 Apr 5 19:58:53 hostname kernel: [462034.034915] [<ffffffff81162e17>] __alloc_pages_nodemask+0xab7/0xbb0 Apr 5 19:58:53 hostname kernel: [462034.034919] [<ffffffff811a4102>] alloc_pages_current+0xb2/0x170 Apr 5 19:58:53 hostname kernel: [462034.034921] [<ffffffff811591c7>] __page_cache_alloc+0xb7/0xd0 Apr 5 19:58:53 hostname kernel: [462034.034923] [<ffffffff8115afbd>] filemap_fault+0x28d/0x440 Apr 5 19:58:53 hostname kernel: [462034.034926] [<ffffffff811811ef>] __do_fault+0x6f/0x530 Apr 5 19:58:53 hostname kernel: [462034.034928] [<ffffffff81185046>] handle_pte_fault+0x96/0x230 Apr 5 19:58:53 hostname kernel: [462034.034930] [<ffffffff81764799>] ? schedule+0x29/0x70 Apr 5 19:58:53 hostname kernel: [462034.034932] [<ffffffff811855eb>] __handle_mm_fault+0x1db/0x360 Apr 5 19:58:53 hostname kernel: [462034.034934] [<ffffffff81185823>] handle_mm_fault+0xb3/0x160 Apr 5 19:58:53 hostname kernel: [462034.034937] [<ffffffff8176c720>] __do_page_fault+0x1b0/0x580 Apr 5 19:58:53 hostname kernel: [462034.034940] [<ffffffff8101ce89>] ? read_tsc+0x9/0x20 Apr 5 19:58:53 hostname kernel: [462034.034943] [<ffffffff810d329c>] ? ktime_get_ts+0x4c/0xe0 Apr 5 19:58:53 hostname kernel: [462034.034946] [<ffffffff811deb4d>] ? poll_select_copy_remaining+0xed/0x140 Apr 5 19:58:53 hostname kernel: [462034.034948] [<ffffffff8176cb0a>] do_page_fault+0x1a/0x70 Apr 5 19:58:53 hostname kernel: [462034.034950] [<ffffffff81768b28>] page_fault+0x28/0x30
因為我必須回答這個問題才能關閉它:根據邁克爾漢普頓的評論,更新核心(到 .85)解決了這個問題。