dist-upgrade 後負載均衡器標記“unhealthy”新組成員實例（ubuntu）

November 23, 2021

我的GCloud 上的一個實例組後面有一些虛擬機（用作 Web 伺服器）。

像往常一樣，我更新了（apt dist-upgrade）我的“vm-source-image”，創建了一個新模板並將其添加到我的組中。

使用此模板的新成員從未收到來自負載均衡器的任何實際工作請求，它已啟動並執行但未使用。

臨時更新檔

我只通過以下方式進行部分更新（安全更新）：

sudo unattended-upgrade -d

這裡是造成問題的剩餘軟體包的列表：

# apt list --upgradable

cloud-init/bionic-updates 21.3-1-g6803368d-0ubuntu1~18.04.4 all [upgradable from: 21.2-3-g899bfaa9-0ubuntu2~18.04.1]
dnsmasq-base/bionic-updates 2.79-1ubuntu0.5 amd64 [upgradable from: 2.79-1ubuntu0.4]
gce-compute-image-packages/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0]
google-compute-engine/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0]
google-compute-engine-oslogin/bionic-updates 20210728.00-0ubuntu1~18.04.0 amd64 [upgradable from: 20210429.00-0ubuntu1~18.04.0]
google-guest-agent/bionic-updates 20210629.00-0ubuntu1~18.04.1 amd64 [upgradable from: 20210414.00-0ubuntu1~18.04.0]
libgnutls30/bionic-updates 3.5.18-1ubuntu1.5 amd64 [upgradable from: 3.5.18-1ubuntu1.4]
libnetplan0/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4]
libpcre2-8-0/bionic 10.39-1+ubuntu18.04.1+deb.sury.org+1 amd64 [upgradable from: 10.36-2+ubuntu18.04.1+deb.sury.org+2]
netplan.io/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4]
nplan/bionic-updates 0.99-0ubuntu3~18.04.5 all [upgradable from: 0.99-0ubuntu3~18.04.4]
snapd/bionic-updates 2.51.1+18.04 amd64 [upgradable from: 2.49.2+18.04]
ubuntu-advantage-tools/bionic-updates 27.3~18.04.1 amd64 [upgradable from: 27.2.2~18.04.1]

真正的解決方案

由於我在機器上沒有“自定義”包，並且這個問題的根源來自系統更新，除了通過這篇文章指出問題外，我沒有看到任何解決方案。

當然，我正在監視新的更新，希望這個軟體包的新版本能夠解決問題，但可能沒有更好的選擇嗎？

更多資訊

該組是“內部 TCP 負載平衡器”的後端。
負載均衡器的前端 IP 地址是10.0.0.116
舊的（和工作的）成員 IP 地址是10.0.0.48 （查看日誌）
新的（和失業的）成員 IP 地址是10.0.0.54 （查看日誌）
負載均衡器有一個簡單的 HTTP 健康檢查，稱為HTTPHC1。
實例組有另一個簡單的 HTTP 健康檢查，稱為HTTPHC2。

將舊（和工作）成員的訪問日誌與新成員的訪問日誌進行比較：

舊 VM 成員的日誌

35.191.1.148 "/" - - - [04/Nov/2021:10:34:59 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.144 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.154 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.147 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.145 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.151 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.153 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"

新 VM 成員的日誌

35.191.1.152 "/" - - - [04/Nov/2021:10:31:01 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.154 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.148 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"

差異表明HTTPHC1的日誌失去。

所以新的新不響應負載均衡器（HTTPHC1）的健康檢查並且不接收請求，這就是問題所在。

其他故障 新機也無法通過瀏覽器-window-SSH訪問

添加 tcpdump

HTTPHC1健康檢查員和失業成員之間：

# tcpdump -n host 35.191.1.151
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
11:30:35.109469 IP 35.191.1.151.61838 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:36.119470 IP 35.191.1.151.61838 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:38.167436 IP 35.191.1.151.61838 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:40.110784 IP 35.191.1.151.59900 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:41.111176 IP 35.191.1.151.59900 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:43.159164 IP 35.191.1.151.59900 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:45.112162 IP 35.191.1.151.36064 &gt; 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0

請注意，目標是負載平衡器前端 IP：10.0.0.116，當然它們只是同步數據包。

HTTPHC2健康檢查器和失業成員之間：

# tcpdump -n host 35.191.1.148
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
10:46:12.475724 IP 35.191.1.148.64638 &gt; 10.0.0.54.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
10:46:12.475788 IP 10.0.0.54.80 &gt; 35.191.1.148.64638: Flags [S.], win 64768, options [mss 1420,sackOK,TS,nop,wscale 7], length 0
10:46:12.476239 IP 35.191.1.148.64638 &gt; 10.0.0.54.80: Flags [.], ack 1, win 256, options [nop,nop,TS], length 0
10:46:12.476239 IP 35.191.1.148.64638 &gt; 10.0.0.54.80: Flags [P.], seq 1:117, ack 1, win 256, options [nop,nop,TS], length 116: HTTP: GET /?id=HTTPHC2 HTTP/1.1
10:46:12.476301 IP 10.0.0.54.80 &gt; 35.191.1.148.64638: Flags [.], ack 117, win 506, options [nop,nop,TS], length 0
10:46:12.476546 IP 10.0.0.54.80 &gt; 35.191.1.148.64638: Flags [P.], seq 1:867, ack 117, win 506, options [nop,nop,TS], length 866: HTTP: HTTP/1.1 200 OK
10:46:12.476659 IP 35.191.1.148.64638 &gt; 10.0.0.54.80: Flags [.], ack 867, win 267, options [nop,nop,TS], length 0
10:46:12.476679 IP 35.191.1.148.64638 &gt; 10.0.0.54.80: Flags [F.], seq 117, ack 867, win 267, options [nop,nop,TS], length 0
10:46:12.476707 IP 10.0.0.54.80 &gt; 35.191.1.148.64638: Flags [F.], seq 867, ack 118, win 506, options [nop,nop,TS], length 0
10:46:12.476879 IP 35.191.1.148.64638 &gt; 10.0.0.54.80: Flags [.], ack 868, win 267, options [nop,nop,TS], length 0

這裡一切都很好。

添加 2021-11-16

經過一番研究，我發現本地表中缺少 IP 別名，毫不奇怪，這是前端負載均衡器 IP 地址，在tcpdump!

這裡的工作機器：

# ip route show dev ens4 table local
local 10.0.0.48 proto kernel scope host src 10.0.0.48 
local 10.0.0.116 proto 66 scope host 
# uname -r
5.4.0-1056-gcp

這裡是完全更新的機器：

# ip route show dev ens4 table local
local 10.0.0.54 proto kernel scope host src 10.0.0.54
# uname -r
5.4.0-1057-gcp

添加 2021-11-20

現在它成為一個已知問題：$$ Cloud Networking $$潛在的服務問題：調查

Google Cloud 全域 TCP 代理負載平衡器可能無法通過使用 34.111.0.0/17 範圍內的 IP 配置的轉發規則來提供流量。IP 範圍的永久修復正在進行中

經過測試，cloud-init是根本原因。
根據這個註釋，disable_network_activation: true應該設置避免與 google-guest-agent服務衝突。
解決方案是在 config.xml 中添加設置cloud-init。
cat &gt; /etc/cloud/cloud.cfg.d/99-disable-network-activation.cfg &lt;&lt;EOF
# Disable network activation to prevent \`cloud-init\` from making network
# changes that conflict with \`google-guest-agent\`.
# See: https://github.com/canonical/cloud-init/pull/1048

disable_network_activation: true
EOF
該文件存在於官方鏡像ubuntu-1804-bionic-v20211103中。
添加此文件後，google-guest-agent執行正常。

引用自：https://serverfault.com/questions/1082601

dist-upgrade 後負載均衡器標記“unhealthy”新組成員實例（ubuntu）

添加 2021-11-16

添加 2021-11-20

相關問答

你能在沒有負載平衡的情況下使用 SSL - Google Cloud Platform

如何在伺服器上將 GOOGLE_APPLICATION_CREDENTIALS 與 gcloud 一起使用？

由於公鑰不可用，無法驗證以下簽名：NO_PUBKEY 6A030B21BA07F4FB

託管實例組實例是否需要外部 IP 作為 GLB 後端？

不知道中間代理 IP 地址時使用 Nginx real_ip

GCP負載均衡器非常高的健康檢查頻率