Nginx

PHP-FPM 套接字重用導致錯誤

  • May 1, 2017

我有一個在 App Engine 中執行的網路伺服器容器,它提供 REST API。我嘗試過一個相對標準的實現——使用 TCP 套接字的 nginx + PHP-FPM(由於某種原因,我還沒有讓 unix 套接字工作)。DB 連接也是一個執行在 Google Cloud VPN 上的 TCP 套接字。

問題

我在 API 上獲得了約 25% 的可用性。通常請求會504 Gateway Timeout在最大時間之後(App Engine 的 nginx 代理設置為 60 秒)。有時它會502 Bad Gateway在 PHP-FPM 超時 ( request_terminate_timeout) 時發生。

我正在嘗試確定這是 App Engine nginx 配置錯誤、我的 nginx 還是我的 PHP-FPM 配置。Nginx 應該關閉套接字或重用它們,但似乎兩者都沒有。

客戶端

當我在siege任何給定端點(25 個使用者)停留幾分鐘時,我看到:

HTTP/1.1 504    60.88 secs:     176 bytes ==> GET  /path/to/rest
                      ...15 lines...
HTTP/1.1 504    61.23 secs:     176 bytes ==> GET  /path/to/rest

HTTP/1.1 200    57.54 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    57.68 secs:   13143 bytes ==> GET  /path/to/rest

HTTP/1.1 504    60.76 secs:     176 bytes ==> GET  /path/to/rest
                      ...15 lines...
HTTP/1.1 504    61.06 secs:     176 bytes ==> GET  /path/to/rest

HTTP/1.1 200    33.35 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    32.97 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    36.61 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    39.00 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    42.47 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    48.51 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    56.45 secs:   13143 bytes ==> GET  /path/to/rest

# Another run
HTTP/1.1 200     7.65 secs:   13143 bytes ==> GET  /path/to/rest
                      ...10 lines...
HTTP/1.1 200     8.20 secs:   13143 bytes ==> GET  /path/to/rest

HTTP/1.1 502    47.15 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    47.15 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     8.30 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 504    61.15 secs:     176 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.46 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.33 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.25 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    53.63 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    48.40 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.31 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.97 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.27 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.26 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.08 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.56 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.83 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.85 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    58.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.40 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.21 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.61 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.65 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.13 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.96 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.48 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.81 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.89 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.26 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.80 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.44 secs:     166 bytes ==> GET  /path/to/rest

這也發生在只有 1 個使用者:

HTTP/1.1 502    55.43 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.71 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.54 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.21 secs:     166 bytes ==> GET  /path/to/rest

伺服器端

每種情況的 Nginx 日誌:

# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...

# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"

# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"

netstat -t看起來像:

# Before starting
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 

# During the siege    
tcp        0      0 localhost:56144         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34565        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53073      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 localhost:56148         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53071      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34580        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34584        ESTABLISHED 
tcp        0      0 localhost:56106         localhost:9000          ESTABLISHED 
tcp        0      0 localhost:56191         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34566        ESTABLISHED 
tcp        0      0 localhost:56113         localhost:9000          ESTABLISHED 
tcp        0      0 localhost:56150         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34591        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34574        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53102      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53051      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34572        ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56146         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56117         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56179         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56160         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56168         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56170         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56111         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56115         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56123         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56109         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56113         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56140         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56181         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56121         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56191         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56119         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56142         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56106         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56110         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56144         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56148         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56150         ESTABLISHED 

# A minute or so after ending the siege
tcp        0      0 5c2ad0938ce9:53319      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34578        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34576        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34570        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34565        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34580        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34584        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34566        ESTABLISHED 
tcp        0      0 localhost:56396         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34591        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34574        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34572        ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56396         ESTABLISHED 

配置

nginx.conf

user  www-data;
worker_processes  auto;
worker_cpu_affinity auto;
events {
   worker_connections  512;
}
http {
   server_tokens   off;
   fastcgi_ignore_client_abort off;
   keepalive_timeout  650;
   keepalive_requests 10000;
   gzip on;
   ..more gzip settings..
   server {
       charset utf-8;
       client_max_body_size 512M;
       listen 8080;
       rewrite_log on;
       root        /app/web;
       index       index.php;
       location / {
           try_files $uri /index.php?$args;
       }
       location ~ \.php$ {
           fastcgi_pass 127.0.0.1:9000;
           include /etc/nginx/fastcgi_params;
           fastcgi_keep_conn off;
           fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
       }
   }
   include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}

php-fpm (www.conf)

[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45

更多資訊

禁用 keepalive 是個壞主意,因為 App Engine 會不斷輪詢容器進行健康檢查,這會產生很多死TIME_WAIT套接字(我試過了)。

之前request_terminate_timeout有很多CLOSE_WAIT套接字而不是TIME_WAIT. 設置request_terminate_timeout = 45在某種意義上確實有幫助,因為工作程序被殺死並在200重生後再次提供服務。較低的終止超時只會產生更多502的 s 和更少504的 s。

process_idle_timeout忽略,因為套接字在技術上不是空閒的。

設置fastcgi_keep_conn on對 nginx 的行為沒有可衡量的影響。

事實證明問題與容器配置有關,而不是與應用程序有關。將 MTU 設置為適用於 Google 的云網路的值(從 1500 下降到 1430)後,查詢應用程序不再有任何問題。

這是通過將問題隔離到僅通過 Google Cloud VPN 打開到數據庫的套接字的請求而發現的(請參閱日誌postgresql中的條目netstat)。我們碰巧有一個 VPN 路由到第二個 VPN,它的數據庫連接執行良好,因為只有第一跳承載高 MTU 流量。

引用自:https://serverfault.com/questions/846420