Nginx 負載平衡不良配置或不良行為?
我目前正在使用 Nginx 作為負載平衡器,以平衡執行 NodeJS API 的 3 個節點之間的網路流量。
Nginx 實例在 node1 上執行,每個請求都向 node1 發出。我在 2 小時內看到了大約 700k 的請求,並且 nginx 配置為以循環方式在 node1、node2 和 node3 之間切換它們。這裡
conf.d/deva.conf
:upstream deva_api { server 10.8.0.30:5555 fail_timeout=5s max_fails=3; server 10.8.0.40:5555 fail_timeout=5s max_fails=3; server localhost:5555; keepalive 300; } server { listen 8000; location /log_pages { proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_http_version 1.1; proxy_set_header Connection ""; add_header 'Access-Control-Allow-Origin' '*'; add_header 'Access-Control-Allow-Methods' 'GET, POST, PATCH, PUT, DELETE, OPTIONS'; add_header 'Access-Control-Allow-Headers' 'Authorization,Content-Type,Origin,X-Auth-Token'; add_header 'Access-Control-Allow-Credentials' 'true'; if ($request_method = OPTIONS ) { return 200; } proxy_pass http://deva_api; proxy_set_header Connection "Keep-Alive"; proxy_set_header Proxy-Connection "Keep-Alive"; auth_basic "Restricted"; #For Basic Auth auth_basic_user_file /etc/nginx/.htpasswd; #For Basic Auth } }
這裡是
nginx.conf
配置:user www-data; worker_processes auto; pid /run/nginx.pid; include /etc/nginx/modules-enabled/*.conf; worker_rlimit_nofile 65535; events { worker_connections 65535; use epoll; multi_accept on; } http { ## # Basic Settings ## sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 120; send_timeout 120; types_hash_max_size 2048; server_tokens off; client_max_body_size 100m; client_body_buffer_size 5m; client_header_buffer_size 5m; large_client_header_buffers 4 1m; open_file_cache max=200000 inactive=20s; open_file_cache_valid 30s; open_file_cache_min_uses 2; open_file_cache_errors on; reset_timedout_connection on; include /etc/nginx/mime.types; default_type application/octet-stream; ## # SSL Settings ## ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE ssl_prefer_server_ciphers on; ## # Logging Settings ## access_log /var/log/nginx/access.log; error_log /var/log/nginx/error.log; ## # Gzip Settings ## gzip on; include /etc/nginx/conf.d/*.conf; include /etc/nginx/sites-enabled/*; }
問題是,使用此配置,我在 error.log 中收到數百個錯誤,如下所示:
upstream prematurely closed connection while reading response header from upstream
但僅在 node2 和 node3 上。我已經嘗試過以下測試:
- 增加每個節點的並發 API 數量(實際上我使用 PM2 作為節點內平衡器)
- 刪除一個節點以使 nginx 的工作更容易
- 將權重應用於 nginx
沒有什麼能讓結果更好。在這些測試中,我注意到僅在 2 個遠端節點(node2 和 node3)上存在錯誤,因此我嘗試將它們從等式中刪除。結果是我不再出現類似的錯誤,但我開始出現 2 個不同的錯誤:
recv() failed (104: Connection reset by peer) while reading response header from upstream
和
writev() failed (32: Broken pipe) while sending request to upstream
問題似乎是由於 node1 上缺少 API,API 可能無法在客戶端超時之前響應所有入站流量(這是我的猜測)。說,我增加了 node1 上的並發 API 數量,結果比以前的要好,但我繼續收到後 2 個錯誤,我不能再增加 node1 上的並發 API。
那麼,問題是,為什麼我不能將 nginx 用作所有節點的負載均衡器?我在 nginx 配置中犯了錯誤嗎?還有其他我沒有註意到的問題嗎?
編輯: 我在 3 個節點之間執行一些網路測試。節點通過 Openvpn 相互通信:
PING:
node1->node2 PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data. 64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.85 ms 64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=1.85 ms 64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=3.17 ms 64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=3.21 ms 64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.68 ms node1->node2 PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data. 64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.16 ms 64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=3.08 ms 64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=10.9 ms 64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=3.11 ms 64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=3.25 ms node2->node1 PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data. 64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=2.30 ms 64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=8.30 ms 64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.37 ms 64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.42 ms 64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.37 ms node2->node3 PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data. 64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.86 ms 64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=4.01 ms 64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=5.37 ms 64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=2.78 ms 64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.87 ms node3->node1 PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data. 64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=8.24 ms 64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=2.72 ms 64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.63 ms 64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.91 ms 64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.14 ms node3->node2 PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data. 64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.73 ms 64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=2.38 ms 64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=3.22 ms 64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=2.76 ms 64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=2.97 ms
通過 IPerf 進行頻寬檢查:
node1 -> node2 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 229 MBytes 192 Mbits/sec node2->node1 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 182 MBytes 152 Mbits/sec node3->node1 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 160 MBytes 134 Mbits/sec node3->node2 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 260 MBytes 218 Mbits/sec node2->node3 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 241 MBytes 202 Mbits/sec node1->node3 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 187 MBytes 156 Mbits/sec
OpenVPN 隧道似乎存在瓶頸,因為相同的測試通過
eth
大約 1Gbits。話雖如此,我已經遵循了這個指南community.openvpn.net但我只得到了之前測量的頻寬的兩倍。我想保持 OpenVPN 處於開啟狀態,那麼是否需要進行任何其他調整以增加網路頻寬或對 nginx 配置進行任何其他調整以使其正常工作?
這些問題是由 OpenVPN 網路緩慢引起的。通過在每個不同的伺服器上添加身份驗證後在網際網路上路由請求,我們將錯誤降低到每天 1-2 次,現在可能是由其他一些問題引起的。