TCP 連接過多導致斷開連接

November 15, 2016

我有一個使用 TCP 連接執行的遊戲伺服器。伺服器隨機斷開使用者。我認為它與伺服器的 TCP 設置有關。

在本地開發環境中，編寫的程式碼可以處理 8000 多個並髮使用者，而不會出現任何斷開連接或錯誤（在本地主機上）。

但在實際部署的 Centos 5 64 位伺服器中，伺服器創建這些斷開連接獨立於並發 tcp 連接量。

伺服器似乎無法處理吞吐量。

netstat -s -t
IcmpMsg:
   InType0: 31
   InType3: 87717
   InType4: 699
   InType5: 2
   InType8: 1023781
   InType11: 7211
   OutType0: 1023781
   OutType3: 603
Tcp:
   8612766 active connections openings
   14255236 passive connection openings
   12174 failed connection attempts
   319225 connection resets received
   723 connections established
   6351090913 segments received
   6180297746 segments send out
   45791634 segments retransmited
   0 bad segments received.
   1664280 resets sent
TcpExt:
   46244 invalid SYN cookies received
   3745 resets received for embryonic SYN_RECV sockets
   327 ICMP packets dropped because they were out-of-window
   1 ICMP packets dropped because socket was locked
   11475281 TCP sockets finished time wait in fast timer
   140 time wait sockets recycled by time stamp
   1569 packets rejects in established connections because of timestamp
   103783714 delayed acks sent
   6929 delayed acks further delayed because of locked socket
   Quick ack mode was activated 6210096 times
   1806 times the listen queue of a socket overflowed
   1806 SYNs to LISTEN sockets ignored
   1080380601 packets directly queued to recvmsg prequeue.
   31441059 packets directly received from backlog
   5272599307 packets directly received from prequeue
   324498008 packets header predicted
   1143146 packets header predicted and directly queued to user
   3217838883 acknowledgments not containing data received
   1027969883 predicted acknowledgments
   395 times recovered from packet loss due to fast retransmit
   257420 times recovered from packet loss due to SACK data
   5843 bad SACKs received
   Detected reordering 29 times using FACK
   Detected reordering 12 times using SACK
   Detected reordering 1 times using reno fast retransmit
   Detected reordering 809 times using time stamp
   1602 congestion windows fully recovered
   1917 congestion windows partially recovered using Hoe heuristic
   TCPDSACKUndo: 8196226
   7850525 congestion windows recovered after partial ack
   139681 TCP data loss events
   TCPLostRetransmit: 26
   10139 timeouts after reno fast retransmit
   2802678 timeouts after SACK recovery
   86212 timeouts in loss state
   273698 fast retransmits
   19494 forward retransmits
   2637236 retransmits in slow start
   33381883 other TCP timeouts
   TCPRenoRecoveryFail: 92
   19488 sack retransmits failed
   7 times receiver scheduled too late for direct processing
   6354641 DSACKs sent for old packets
   333 DSACKs sent for out of order packets
   20615579 DSACKs received
   2724 DSACKs for out of order packets received
   123034 connections reset due to unexpected data
   91876 connections reset due to early user close
   169244 connections aborted due to timeout
   28736 times unabled to send RST due to no memory
IpExt:
   InMcastPkts: 2

讓我想到的是，這些似乎很有問題。

123034 connections reset due to unexpected data
91876 connections reset due to early user close
28736 times unabled to send RST due to no memory

我該如何解決這些錯誤？我需要進行 TCP 調整嗎？

**編輯：**一些 sysctl 資訊：

sysctl -A | grep net | grep mem
net.ipv4.udp_wmem_min = 4096
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_mem = 772704       1030272 1545408
net.ipv4.tcp_rmem = 4096        87380   4194304
net.ipv4.tcp_wmem = 4096        16384   4194304
net.ipv4.tcp_mem = 196608       262144  393216
net.ipv4.igmp_max_memberships = 20
net.core.optmem_max = 20480
net.core.rmem_default = 129024
net.core.wmem_default = 129024
net.core.rmem_max = 131071
net.core.wmem_max = 131071

編輯： 2 個檢測到的乙太網卡的 ethtool 資訊：

Settings for eth0:
       Supported ports: [ TP ]
       Supported link modes:   10baseT/Half 10baseT/Full
                               100baseT/Half 100baseT/Full
                               1000baseT/Full
       Supports auto-negotiation: Yes
       Advertised link modes:  10baseT/Half 10baseT/Full
                               100baseT/Half 100baseT/Full
                               1000baseT/Full
       Advertised auto-negotiation: Yes
       Speed: 1000Mb/s
       Duplex: Full
       Port: Twisted Pair
       PHYAD: 1
       Transceiver: internal
       Auto-negotiation: on
       Supports Wake-on: g
       Wake-on: d
       Link detected: yes

Settings for eth1:
       Supported ports: [ TP ]
       Supported link modes:   10baseT/Half 10baseT/Full
                               100baseT/Half 100baseT/Full
                               1000baseT/Full
       Supports auto-negotiation: Yes
       Advertised link modes:  10baseT/Half 10baseT/Full
                               100baseT/Half 100baseT/Full
                               1000baseT/Full
       Advertised auto-negotiation: Yes
       Speed: Unknown!
       Duplex: Half
       Port: Twisted Pair
       PHYAD: 1
       Transceiver: internal
       Auto-negotiation: on
       Supports Wake-on: g
       Wake-on: d
       Link detected: no

你增加FD限額嗎？您可以在此處獲取一些資訊http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

如果您的意思是客戶端在沒有預期的 FIN、ACK、RST 通信的情況下斷開連接，那麼“伺服器會隨機斷開使用者連接”，我將首先解決半雙工介面，特別是如果您的開發環境中的兩個 NIC 都是全雙工的。eth1 介面在 Auto-negotiation=on 時處於半雙工狀態通常是由以下任一原因引起的：
交換機和伺服器之間的自動協商失敗。
禁用自動協商的交換機，明確設置埠的速度和雙工。
我在情況 #2 中更頻繁地看到它，但這可能是因為自從我故意發現自動協商失敗檢查以來已經十多年了。當一側為自動且另一側為硬編碼（或無法響應）時，乙太網自動協商行為是讓自動側進入半雙工模式。
簡單來說，Eth1 處於半雙工狀態導致伺服器只通過介面發送或接收數據，而不是發送和接收。硬編碼端仍將處於全雙工模式，並在從伺服器接收數據的同時嘗試向伺服器發送數據。然而，伺服器會認為這是一個衝突，因為它假定一個衝突域，其中全雙工消除了衝突域。伺服器將使用退避算法來安排重傳。如果伺服器繼續遇到它認為是衝突的情況，它將繼續增加等待重新傳輸數據的時間。
因此，擁有帶全雙工合作夥伴的半雙工很容易導致客戶端斷開連接、吞吐量或性能問題、延遲峰值和其他各種問題。

引用自：https://serverfault.com/questions/527440

TCP 連接過多導致斷開連接

相關問答

1U 盒中有 10 個網卡 - 可能嗎？

是否可以僅使用 shell 偵聽 TCP 埠而無需其他工具？

用路由器擷取 TCP 數據包

我怎麼知道 tcp 調整是否對我有幫助？

模擬兩台 ubuntu 伺服器機器之間的慢速連接

伺服器發出大於 MTU 的數據包