創建太多虛擬主機後 Apache 停止與 memcache 通信
我注意到 Apache 有一個非常特殊的問題。我設置了非常多的虛擬主機 - 大約是 501。
在虛擬主機編號 493 之後開始出現問題。前 493 個虛擬主機按預期工作,但是一旦我添加虛擬主機編號 494,PHP 就會停止與記憶體記憶體通信,並且每次讀/寫訪問都會超時。
實際上,我使用 memcache 作為後端會話儲存,所以,php 函式:
session_start();
只需在 30 秒後超時。
如果我刪除 494 個虛擬主機中的隨機一個並重新啟動 apache,它會再次開始工作。
我已經將 ulimit 設置得非常高(65k),但它沒有幫助。我試過完全關閉 ulimit,但沒有運氣。
你們有什麼想法我還能嘗試什麼嗎?
在我在瀏覽器中輸入並等待 30 秒後,我嘗試跟踪我連接到的 httpd 程序。
這是 strace 輸出:
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998}) select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
所以基本上apache卡在select()上,就是這樣,它無限期地重複select()系統呼叫。
我想出的下一件事是 tcpdump,看看這個包是否真的從 apache 中通過,並且確實如此:
22:11:28.366677 IP6 ::1.51404 > ::1.11914: Flags [S], seq 2899674987, win 32752, options [mss 16376,sackOK,TS val 1384759049 ecr 0,nop,wscale 9], length 0 22:11:28.366697 IP6 ::1.11914 > ::1.51404: Flags [S.], seq 2034630080, ack 2899674988, win 32728, options [mss 16376,sackOK,TS val 1384759049 ecr 1384759049,nop,wscale 9], length 0 22:11:28.366709 IP6 ::1.51404 > ::1.11914: Flags [.], ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0 22:11:28.366752 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 1:41, ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 40 22:11:28.366758 IP6 ::1.11914 > ::1.51404: Flags [.], ack 41, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0 22:11:28.366768 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 41:90, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759049], length 49 22:11:28.366772 IP6 ::1.11914 > ::1.51404: Flags [.], ack 90, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.366779 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 90:122, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 32 22:11:28.366783 IP6 ::1.11914 > ::1.51404: Flags [.], ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367063 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 1:12, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 11 22:11:28.367070 IP6 ::1.51404 > ::1.11914: Flags [.], ack 12, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367266 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 12:20, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 8 22:11:28.367275 IP6 ::1.51404 > ::1.11914: Flags [.], ack 20, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367477 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 20:25, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 5 22:11:28.367489 IP6 ::1.51404 > ::1.11914: Flags [.], ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0 22:11:28.367629 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 122:181, ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 59 22:11:28.367859 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 25:33, ack 181, win 64, options [nop,nop,TS val 1384759051 ecr 1384759050], length 8 22:11:28.367869 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 181:230, ack 33, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 49 22:11:28.368102 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 33:41, ack 230, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 8 22:11:28.368138 IP6 ::1.51404 > ::1.11914: Flags [F.], seq 230, ack 41, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0 22:11:28.368195 IP6 ::1.11914 > ::1.51404: Flags [F.], seq 41, ack 231, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0 22:11:28.368206 IP6 ::1.51404 > ::1.11914: Flags [.], ack 42, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
當我向包含 session_start() 的頁面發出 curl 呼叫時,我做的下一件事是 Apache 程序的 GDB,這是輸出:
232 *(*new)->local_addr = *sock->local_addr; 241 if (sock->local_addr->sa.sin.sin_family == AF_INET) { 238 (*new)->local_addr->pool = connection_context; 241 if (sock->local_addr->sa.sin.sin_family == AF_INET) { 238 (*new)->local_addr->pool = connection_context; 241 if (sock->local_addr->sa.sin.sin_family == AF_INET) { 245 else if (sock->local_addr->sa.sin.sin_family == AF_INET6) { 246 (*new)->local_addr->ipaddr_ptr = &(*new)->local_addr->sa.sin6.sin6_addr; 249 (*new)->remote_addr->port = ntohs((*new)->remote_addr->sa.sin.sin_port); 250 if (sock->local_port_unknown) { 256 if (apr_is_option_set(sock, APR_TCP_NODELAY) == 1) { 257 apr_set_option(*new, APR_TCP_NODELAY, 1); 266 if (sock->local_interface_unknown || 267 !memcmp(sock->local_addr->ipaddr_ptr, 266 if (sock->local_interface_unknown || 276 (*new)->local_interface_unknown = 1; 293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup, 292 (*new)->inherit = 0; 293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup, 296 } unixd_accept (accepted=0x7fff14ecddf0, lr=0x7fe93a905aa8, ptrans=<value optimized out>) at /usr/src/debug/httpd-2.2.15/os/unix/unixd.c:507 507 if (status == APR_SUCCESS) { 508 *accepted = csd; 649 } child_main (child_num_arg=<value optimized out>) at /usr/src/debug/httpd-2.2.15/server/mpm/prefork/prefork.c:650 650 SAFE_ACCEPT(accept_mutex_off()); /* unlock after "accept" */ 652 if (status == APR_EGENERAL) { 656 else if (status != APR_SUCCESS) { 665 current_conn = ap_run_create_connection(ptrans, ap_server_conf, csd, my_child_num, sbh, bucket_alloc); 666 if (current_conn) { 667 ap_process_connection(current_conn, csd);
在這個位置有一個很大的停頓(~30 秒),直到 php 超時。在那之後,我得到了這個:
668 ap_lingering_close(current_conn); 676 if (ap_mpm_pod_check(pod) == APR_SUCCESS) { /* selected as idle? */ 680 ap_scoreboard_image->global->running_generation) { /* restart? */ 679 else if (ap_my_generation != 680 ap_scoreboard_image->global->running_generation) { /* restart? */ 679 else if (ap_my_generation != 551 while (!die_now && !shutdown_pending) { 559 apr_pool_clear(ptrans); 562 && requests_this_child++ >= ap_max_requests_per_child)) { 561 if ((ap_max_requests_per_child > 0 562 && requests_this_child++ >= ap_max_requests_per_child)) { 561 if ((ap_max_requests_per_child > 0 562 && requests_this_child++ >= ap_max_requests_per_child)) { 561 if ((ap_max_requests_per_child > 0 566 (void) ap_update_child_status(sbh, SERVER_READY, (request_rec *) NULL); 573 SAFE_ACCEPT(accept_mutex_on()); 575 if (num_listensocks == 1) {
最奇怪的是我無法在另一台機器上重現它。相同的作業系統,相同的軟體包,相同的配置(傀儡)相同的核心,不同的硬體。
經過幾週的調試和注意問題,我終於偶然發現了一條消息:
You MUST recompile PHP with a larger value of FD_SETSIZE. It is set to 1024, but you have descriptors numbered at least as high as 1073. --enable-fd-setsize=2048 is recommended, but you may want to set it to equal the maximum number of open files supported by your system, in order to avoid seeing this error again at a later date.
我會嘗試這個修復,但是天哪,天哪,PHP 的人為什麼要這樣做?這太醜陋了,硬編碼 nofile 限制是完全破壞的設計。更不用說如果這是解決方案,強迫我重新編譯每個 PHP 次要版本和安全更新檔並維護我自己的包是一個很大的麻煩。
編輯:經過更廣泛的調試後,似乎不僅僅是 PHP 被“設計破壞”,memcache 擴展本身也存在很多問題。
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629896
https://bugs.php.net/bug.php?id=59876
錯誤已經打開了很長一段時間,但沒有任何反應。我想應該只是轉儲 memcache 擴展並找到獨立於它的解決方案:-/