Networking

Nginx 負載平衡不良配置或不良行為?

  • March 12, 2019

我目前正在使用 Nginx 作為負載平衡器,以平衡執行 NodeJS API 的 3 個節點之間的網路流量。

Nginx 實例在 node1 上執行,每個請求都向 node1 發出。我在 2 小時內看到了大約 700k 的請求,並且 nginx 配置為以循環方式在 node1、node2 和 node3 之間切換它們。這裡conf.d/deva.conf

upstream deva_api {
   server 10.8.0.30:5555 fail_timeout=5s max_fails=3;
   server 10.8.0.40:5555 fail_timeout=5s max_fails=3;
   server localhost:5555;
   keepalive 300;
}

server {

       listen 8000;

       location /log_pages {

               proxy_redirect off;
               proxy_set_header Host $host;
               proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

               proxy_http_version 1.1;
               proxy_set_header Connection "";

               add_header 'Access-Control-Allow-Origin' '*';
               add_header 'Access-Control-Allow-Methods' 'GET, POST, PATCH, PUT, DELETE, OPTIONS';
               add_header 'Access-Control-Allow-Headers' 'Authorization,Content-Type,Origin,X-Auth-Token';
               add_header 'Access-Control-Allow-Credentials' 'true';

               if ($request_method = OPTIONS ) {
                       return 200;
               }

               proxy_pass http://deva_api;
               proxy_set_header Connection "Keep-Alive";
               proxy_set_header Proxy-Connection "Keep-Alive";

               auth_basic "Restricted";                                #For Basic Auth
               auth_basic_user_file /etc/nginx/.htpasswd;  #For Basic Auth
       }
}

這裡是nginx.conf配置:

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

worker_rlimit_nofile 65535;
events {
       worker_connections 65535;
       use epoll;
       multi_accept on;
}

http {

       ##
       # Basic Settings
       ##

       sendfile on;
       tcp_nopush on;
       tcp_nodelay on;
       keepalive_timeout 120;
       send_timeout 120;
       types_hash_max_size 2048;
       server_tokens off;

       client_max_body_size 100m;
       client_body_buffer_size  5m;
       client_header_buffer_size 5m;
       large_client_header_buffers 4 1m;

       open_file_cache max=200000 inactive=20s;
       open_file_cache_valid 30s;
       open_file_cache_min_uses 2;
       open_file_cache_errors on;

       reset_timedout_connection on;

       include /etc/nginx/mime.types;
       default_type application/octet-stream;

       ##
       # SSL Settings
       ##

       ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
       ssl_prefer_server_ciphers on;

       ##
       # Logging Settings
       ##

       access_log /var/log/nginx/access.log;
       error_log /var/log/nginx/error.log;

       ##
       # Gzip Settings
       ##

       gzip on;
       include /etc/nginx/conf.d/*.conf;
       include /etc/nginx/sites-enabled/*;
}

問題是,使用此配置,我在 error.log 中收到數百個錯誤,如下所示:

upstream prematurely closed connection while reading response header from upstream

但僅在 node2 和 node3 上。我已經嘗試過以下測試:

  1. 增加每個節點的並發 API 數量(實際上我使用 PM2 作為節點內平衡器)
  2. 刪除一個節點以使 nginx 的工作更容易
  3. 將權重應用於 nginx

沒有什麼能讓結果更好。在這些測試中,我注意到僅在 2 個遠端節點(node2 和 node3)上存在錯誤,因此我嘗試將它們從等式中刪除。結果是我不再出現類似的錯誤,但我開始出現 2 個不同的錯誤:

recv() failed (104: Connection reset by peer) while reading response header from upstream

writev() failed (32: Broken pipe) while sending request to upstream

問題似乎是由於 node1 上缺少 API,API 可能無法在客戶端超時之前響應所有入站流量(這是我的猜測)。說,我增加了 node1 上的並發 API 數量,結果比以前的要好,但我繼續收到後 2 個錯誤,我不能再增加 node1 上的並發 API。

那麼,問題是,為什麼我不能將 nginx 用作所有節點的負載均衡器?我在 nginx 配置中犯了錯誤嗎?還有其他我沒有註意到的問題嗎?

編輯: 我在 3 個節點之間執行一些網路測試。節點通過 Openvpn 相互通信:

PING:

node1->node2
PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data.
64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.85 ms
64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=1.85 ms
64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=3.17 ms
64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=3.21 ms
64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.68 ms

node1->node2
PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data.
64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=3.08 ms
64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=10.9 ms
64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=3.11 ms
64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=3.25 ms

node2->node1
PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data.
64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=2.30 ms
64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=8.30 ms
64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.37 ms
64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.42 ms
64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.37 ms

node2->node3
PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data.
64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.86 ms
64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=4.01 ms
64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=5.37 ms
64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=2.78 ms
64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.87 ms

node3->node1
PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data.
64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=8.24 ms
64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=2.72 ms
64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.63 ms
64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.91 ms
64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.14 ms

node3->node2
PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data.
64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.73 ms
64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=2.38 ms
64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=3.22 ms
64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=2.76 ms
64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=2.97 ms

通過 IPerf 進行頻寬檢查:

node1 -> node2
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   229 MBytes   192 Mbits/sec

node2->node1
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   182 MBytes   152 Mbits/sec

node3->node1
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   160 MBytes   134 Mbits/sec

node3->node2
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   260 MBytes   218 Mbits/sec

node2->node3
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   241 MBytes   202 Mbits/sec

node1->node3
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   187 MBytes   156 Mbits/sec

OpenVPN 隧道似乎存在瓶頸,因為相同的測試通過eth大約 1Gbits。話雖如此,我已經遵循了這個指南community.openvpn.net但我只得到了之前測量的頻寬的兩倍。

我想保持 OpenVPN 處於開啟狀態,那麼是否需要進行任何其他調整以增加網路頻寬或對 nginx 配置進行任何其他調整以使其正常工作?

這些問題是由 OpenVPN 網路緩慢引起的。通過在每個不同的伺服器上添加身份驗證後在網際網路上路由請求,我們將錯誤降低到每天 1-2 次,現在可能是由其他一些問題引起的。

引用自:https://serverfault.com/questions/956793