dist-upgrade 後負載均衡器標記“unhealthy”新組成員實例(ubuntu)
我的GCloud 上的一個實例組後面有一些虛擬機(用作 Web 伺服器) 。
像往常一樣,我更新了(
apt dist-upgrade
)我的“vm-source-image”,創建了一個新模板並將其添加到我的組中。使用此模板的新成員從未收到來自負載均衡器的任何實際工作請求,它已啟動並執行但未使用。
臨時更新檔
我只通過以下方式進行部分更新(安全更新):
sudo unattended-upgrade -d
這裡是造成問題的剩餘軟體包的列表:
# apt list --upgradable cloud-init/bionic-updates 21.3-1-g6803368d-0ubuntu1~18.04.4 all [upgradable from: 21.2-3-g899bfaa9-0ubuntu2~18.04.1] dnsmasq-base/bionic-updates 2.79-1ubuntu0.5 amd64 [upgradable from: 2.79-1ubuntu0.4] gce-compute-image-packages/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0] google-compute-engine/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0] google-compute-engine-oslogin/bionic-updates 20210728.00-0ubuntu1~18.04.0 amd64 [upgradable from: 20210429.00-0ubuntu1~18.04.0] google-guest-agent/bionic-updates 20210629.00-0ubuntu1~18.04.1 amd64 [upgradable from: 20210414.00-0ubuntu1~18.04.0] libgnutls30/bionic-updates 3.5.18-1ubuntu1.5 amd64 [upgradable from: 3.5.18-1ubuntu1.4] libnetplan0/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4] libpcre2-8-0/bionic 10.39-1+ubuntu18.04.1+deb.sury.org+1 amd64 [upgradable from: 10.36-2+ubuntu18.04.1+deb.sury.org+2] netplan.io/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4] nplan/bionic-updates 0.99-0ubuntu3~18.04.5 all [upgradable from: 0.99-0ubuntu3~18.04.4] snapd/bionic-updates 2.51.1+18.04 amd64 [upgradable from: 2.49.2+18.04] ubuntu-advantage-tools/bionic-updates 27.3~18.04.1 amd64 [upgradable from: 27.2.2~18.04.1]
真正的解決方案
由於我在機器上沒有“自定義”包,並且這個問題的根源來自系統更新,除了通過這篇文章指出問題外,我沒有看到任何解決方案。
當然,我正在監視新的更新,希望這個軟體包的新版本能夠解決問題,但可能沒有更好的選擇嗎?
更多資訊
- 該組是“內部 TCP 負載平衡器”的後端。
- 負載均衡器的前端 IP 地址是10.0.0.116
- 舊的(和工作的)成員 IP 地址是10.0.0.48 (查看日誌)
- 新的(和失業的)成員 IP 地址是10.0.0.54 (查看日誌)
- 負載均衡器有一個簡單的 HTTP 健康檢查,稱為HTTPHC1。
- 實例組有另一個簡單的 HTTP 健康檢查,稱為HTTPHC2。
將舊(和工作)成員的訪問日誌與新成員的訪問日誌進行比較:
舊 VM 成員的日誌
35.191.1.148 "/" - - - [04/Nov/2021:10:34:59 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.144 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.154 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.147 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.145 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.151 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.153 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
新 VM 成員的日誌
35.191.1.152 "/" - - - [04/Nov/2021:10:31:01 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.154 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0" 35.191.1.148 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
差異表明HTTPHC1的日誌失去。
所以新的新不響應負載均衡器(HTTPHC1)的健康檢查並且不接收請求,這就是問題所在。
添加 tcpdump
HTTPHC1健康檢查員和失業成員之間:
# tcpdump -n host 35.191.1.151 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes 11:30:35.109469 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 11:30:36.119470 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 11:30:38.167436 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 11:30:40.110784 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 11:30:41.111176 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 11:30:43.159164 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 11:30:45.112162 IP 35.191.1.151.36064 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
請注意,目標是負載平衡器前端 IP:10.0.0.116,當然它們只是同步數據包。
HTTPHC2健康檢查器和失業成員之間:
# tcpdump -n host 35.191.1.148 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes 10:46:12.475724 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0 10:46:12.475788 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [S.], win 64768, options [mss 1420,sackOK,TS,nop,wscale 7], length 0 10:46:12.476239 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 1, win 256, options [nop,nop,TS], length 0 10:46:12.476239 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [P.], seq 1:117, ack 1, win 256, options [nop,nop,TS], length 116: HTTP: GET /?id=HTTPHC2 HTTP/1.1 10:46:12.476301 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [.], ack 117, win 506, options [nop,nop,TS], length 0 10:46:12.476546 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [P.], seq 1:867, ack 117, win 506, options [nop,nop,TS], length 866: HTTP: HTTP/1.1 200 OK 10:46:12.476659 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 867, win 267, options [nop,nop,TS], length 0 10:46:12.476679 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [F.], seq 117, ack 867, win 267, options [nop,nop,TS], length 0 10:46:12.476707 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [F.], seq 867, ack 118, win 506, options [nop,nop,TS], length 0 10:46:12.476879 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 868, win 267, options [nop,nop,TS], length 0
這裡一切都很好。
添加 2021-11-16
經過一番研究,我發現本地表中缺少 IP 別名,毫不奇怪,這是前端負載均衡器 IP 地址,在
tcpdump
!這裡的工作機器:
# ip route show dev ens4 table local local 10.0.0.48 proto kernel scope host src 10.0.0.48 local 10.0.0.116 proto 66 scope host # uname -r 5.4.0-1056-gcp
這裡是完全更新的機器:
# ip route show dev ens4 table local local 10.0.0.54 proto kernel scope host src 10.0.0.54 # uname -r 5.4.0-1057-gcp
添加 2021-11-20
現在它成為一個已知問題:$$ Cloud Networking $$潛在的服務問題:調查
Google Cloud 全域 TCP 代理負載平衡器可能無法通過使用 34.111.0.0/17 範圍內的 IP 配置的轉發規則來提供流量。IP 範圍的永久修復正在進行中
經過測試,
cloud-init
是根本原因。根據這個註釋,
disable_network_activation: true
應該設置避免與google-guest-agent
服務衝突。解決方案是在 config.xml 中添加設置
cloud-init
。cat > /etc/cloud/cloud.cfg.d/99-disable-network-activation.cfg <<EOF # Disable network activation to prevent \`cloud-init\` from making network # changes that conflict with \`google-guest-agent\`. # See: https://github.com/canonical/cloud-init/pull/1048 disable_network_activation: true EOF
該文件存在於官方鏡像
ubuntu-1804-bionic-v20211103
中。添加此文件後,
google-guest-agent
執行正常。