PHP-FPM 套接字重用導致錯誤
我有一個在 App Engine 中執行的網路伺服器容器,它提供 REST API。我嘗試過一個相對標準的實現——使用 TCP 套接字的 nginx + PHP-FPM(由於某種原因,我還沒有讓 unix 套接字工作)。DB 連接也是一個執行在 Google Cloud VPN 上的 TCP 套接字。
問題
我在 API 上獲得了約 25% 的可用性。通常請求會
504 Gateway Timeout
在最大時間之後(App Engine 的 nginx 代理設置為 60 秒)。有時它會502 Bad Gateway
在 PHP-FPM 超時 (request_terminate_timeout
) 時發生。我正在嘗試確定這是 App Engine nginx 配置錯誤、我的 nginx 還是我的 PHP-FPM 配置。Nginx 應該關閉套接字或重用它們,但似乎兩者都沒有。
客戶端
當我在
siege
任何給定端點(25 個使用者)停留幾分鐘時,我看到:HTTP/1.1 504 60.88 secs: 176 bytes ==> GET /path/to/rest ...15 lines... HTTP/1.1 504 61.23 secs: 176 bytes ==> GET /path/to/rest HTTP/1.1 200 57.54 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 57.68 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 504 60.76 secs: 176 bytes ==> GET /path/to/rest ...15 lines... HTTP/1.1 504 61.06 secs: 176 bytes ==> GET /path/to/rest HTTP/1.1 200 33.35 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 32.97 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 36.61 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 39.00 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 42.47 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 48.51 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 56.45 secs: 13143 bytes ==> GET /path/to/rest # Another run HTTP/1.1 200 7.65 secs: 13143 bytes ==> GET /path/to/rest ...10 lines... HTTP/1.1 200 8.20 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 8.30 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 504 61.15 secs: 176 bytes ==> GET /path/to/rest HTTP/1.1 502 54.46 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 54.33 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 54.25 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 53.63 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 48.40 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 7.31 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 6.97 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.27 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.26 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 54.99 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 60.08 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 60.56 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 6.83 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 60.85 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 59.99 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 58.99 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 52.40 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 52.21 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 59.61 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 502 52.65 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 7.13 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 6.96 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.48 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.81 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 6.89 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 59.26 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 6.80 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 59.44 secs: 166 bytes ==> GET /path/to/rest
這也發生在只有 1 個使用者:
HTTP/1.1 502 55.43 secs: 166 bytes ==> GET /path/to/rest HTTP/1.1 200 7.71 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 200 7.54 secs: 13143 bytes ==> GET /path/to/rest HTTP/1.1 502 59.21 secs: 166 bytes ==> GET /path/to/rest
伺服器端
每種情況的 Nginx 日誌:
# 200 Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ... # 502 [error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com" # 504 [error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
這
netstat -t
看起來像:# Before starting tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED # During the siege tcp 0 0 localhost:56144 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED tcp 0 0 5c2ad0938ce9:53073 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED tcp 0 0 localhost:56148 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:53071 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED tcp 0 0 localhost:56106 localhost:9000 ESTABLISHED tcp 0 0 localhost:56191 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED tcp 0 0 localhost:56113 localhost:9000 ESTABLISHED tcp 0 0 localhost:56150 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED tcp 0 0 5c2ad0938ce9:53102 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:53051 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED tcp 8 0 localhost:9000 localhost:56146 ESTABLISHED tcp 0 0 localhost:9000 localhost:56117 TIME_WAIT tcp 8 0 localhost:9000 localhost:56179 ESTABLISHED tcp 8 0 localhost:9000 localhost:56160 ESTABLISHED tcp 0 0 localhost:9000 localhost:56168 TIME_WAIT tcp 0 0 localhost:9000 localhost:56170 TIME_WAIT tcp 0 0 localhost:9000 localhost:56111 TIME_WAIT tcp 0 0 localhost:9000 localhost:56115 TIME_WAIT tcp 8 0 localhost:9000 localhost:56123 ESTABLISHED tcp 0 0 localhost:9000 localhost:56109 TIME_WAIT tcp 8 0 localhost:9000 localhost:56113 ESTABLISHED tcp 0 0 localhost:9000 localhost:56140 TIME_WAIT tcp 0 0 localhost:9000 localhost:56181 TIME_WAIT tcp 0 0 localhost:9000 localhost:56121 TIME_WAIT tcp 8 0 localhost:9000 localhost:56191 ESTABLISHED tcp 0 0 localhost:9000 localhost:56119 TIME_WAIT tcp 0 0 localhost:9000 localhost:56142 TIME_WAIT tcp 8 0 localhost:9000 localhost:56106 ESTABLISHED tcp 0 0 localhost:9000 localhost:56110 TIME_WAIT tcp 8 0 localhost:9000 localhost:56144 ESTABLISHED tcp 8 0 localhost:9000 localhost:56148 ESTABLISHED tcp 8 0 localhost:9000 localhost:56150 ESTABLISHED # A minute or so after ending the siege tcp 0 0 5c2ad0938ce9:53319 192.168.2.29:postgresql ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34578 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34576 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34570 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED tcp 0 0 localhost:56396 localhost:9000 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED tcp 8 0 localhost:9000 localhost:56396 ESTABLISHED
配置
nginx.conf
user www-data; worker_processes auto; worker_cpu_affinity auto; events { worker_connections 512; } http { server_tokens off; fastcgi_ignore_client_abort off; keepalive_timeout 650; keepalive_requests 10000; gzip on; ..more gzip settings.. server { charset utf-8; client_max_body_size 512M; listen 8080; rewrite_log on; root /app/web; index index.php; location / { try_files $uri /index.php?$args; } location ~ \.php$ { fastcgi_pass 127.0.0.1:9000; include /etc/nginx/fastcgi_params; fastcgi_keep_conn off; fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name; } } include /etc/nginx/conf.d/*.conf; # There are no extra conf files }
php-fpm (www.conf)
[www] user = www-data group = www-data listen = 127.0.0.1:9000 pm = ondemand pm.process_idle_timeout = 10s request_terminate_timeout = 45
更多資訊
禁用 keepalive 是個壞主意,因為 App Engine 會不斷輪詢容器進行健康檢查,這會產生很多死
TIME_WAIT
套接字(我試過了)。之前
request_terminate_timeout
有很多CLOSE_WAIT
套接字而不是TIME_WAIT
. 設置request_terminate_timeout = 45
在某種意義上確實有幫助,因為工作程序被殺死並在200
重生後再次提供服務。較低的終止超時只會產生更多502
的 s 和更少504
的 s。被
process_idle_timeout
忽略,因為套接字在技術上不是空閒的。設置
fastcgi_keep_conn on
對 nginx 的行為沒有可衡量的影響。
事實證明問題與容器配置有關,而不是與應用程序有關。將 MTU 設置為適用於 Google 的云網路的值(從 1500 下降到 1430)後,查詢應用程序不再有任何問題。
這是通過將問題隔離到僅通過 Google Cloud VPN 打開到數據庫的套接字的請求而發現的(請參閱日誌
postgresql
中的條目netstat
)。我們碰巧有一個 VPN 路由到第二個 VPN,它的數據庫連接執行良好,因為只有第一跳承載高 MTU 流量。