Docker 導致 VM 變暗
我正在執行由 Nutanix 託管的 Ubuntu 18.04 虛擬機。每隔一段時間,我會看到一個問題,即使用 docker 網路創建
docker-compose
,然後機器將完全無響應。我只見過這種情況 3 次,中間有好幾個月,所以很難找到規律。我為解決此問題而執行的最後一條命令是:
docker-compose -f /path/to/compose.yml up
開始創建網路然後失敗:
Creating network "compose_kong-ee" with the default driver packet_write_wait: Connection to 10.120.160.100 port 22: Broken pipe
現在,如果我在不停止 docker 守護程序的情況下重新啟動機器,系統將會崩潰。查看 kern.log,我看到 br-a249 介面在發生這種情況時(~12:30 UTC)被禁用:
Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.561962] br-a249e56f08c5: port 1(vethe00f673) entered disabled state Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.565381] veth6a978dc: renamed from eth0 Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.583094] br-a249e56f08c5: port 1(vethe00f673) entered disabled state Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.590855] device vethe00f673 left promiscuous mode Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.590860] br-a249e56f08c5: port 1(vethe00f673) entered disabled state Jun 3 12:29:15 USDALXKADV01 kernel: [1397500.520269] IPv6: ADDRCONF(NETDEV_UP): br-342fcad19ff7: link is not ready Jun 3 12:48:52 USDALXKADV01 kernel: [1398677.266687] vmxnet3 0000:03:00.0 ens160: intr type 3, mode 0, 2 vectors allocated Jun 3 12:48:52 USDALXKADV01 kernel: [1398677.268869] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps Jun 3 12:48:52 USDALXKADV01 kernel: [1398677.271911] IPv6: ADDRCONF(NETDEV_UP): ens160: link is not ready Jun 3 12:48:52 USDALXKADV01 kernel: [1398677.271929] IPv6: ADDRCONF(NETDEV_CHANGE): ens160: link becomes ready Jun 3 12:50:57 USDALXKADV01 kernel: [1398801.703190] br-a249e56f08c5: port 2(vetha6450a1) entered disabled state Jun 3 12:50:57 USDALXKADV01 kernel: [1398801.705254] veth1c11b8c: renamed from eth0 Jun 3 12:50:57 USDALXKADV01 kernel: [1398801.718831] br-a249e56f08c5: port 2(vetha6450a1) entered disabled state Jun 3 12:50:57 USDALXKADV01 kernel: [1398801.726647] device vetha6450a1 left promiscuous mode Jun 3 12:50:57 USDALXKADV01 kernel: [1398801.726652] br-a249e56f08c5: port 2(vetha6450a1) entered disabled state
當我進入時,這個介面就啟動了,它對應於我當時正在執行的兩個容器之一。我發給
docker stop
其中一個以促進升級:IP address for docker0: 172.17.0.1 IP address for br-a14bcb10b447: 172.18.0.1 IP address for br-a249e56f08c5: 172.22.0.1
查看 syslog 條目,我可以看到整個 ssh 會話的時間跨度:
Jun 3 12:22:33 USDALXKADV01 systemd[1]: Started Session 468 of user kong. Jun 3 12:26:27 USDALXKADV01 containerd[1468]: time="2020-06-03T12:26:27.921166420Z" level=info msg="shim reaped" id=f3a678f2747a3398a15dd605299d1b18b9d173e6c68d4bb8c8c44e7a56c2ed2a Jun 3 12:26:27 USDALXKADV01 dockerd[12217]: time="2020-06-03T12:26:27.933328211Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.561962] br-a249e56f08c5: port 1(vethe00f673) entered disabled state Jun 3 12:26:28 USDALXKADV01 systemd-networkd[1398]: vethe00f673: Lost carrier Jun 3 12:26:28 USDALXKADV01 systemd-timesyncd[499]: Network configuration changed, trying to establish connection. Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.565381] veth6a978dc: renamed from eth0 Jun 3 12:26:28 USDALXKADV01 systemd-udevd[14938]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.583094] br-a249e56f08c5: port 1(vethe00f673) entered disabled state Jun 3 12:26:28 USDALXKADV01 systemd-networkd[1398]: vethe00f673: Link DOWN Jun 3 12:26:28 USDALXKADV01 networkd-dispatcher[763]: WARNING:Unknown index 173 seen, reloading interface list Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.590855] device vethe00f673 left promiscuous mode Jun 3 12:26:28 USDALXKADV01 kernel: [1397332.590860] br-a249e56f08c5: port 1(vethe00f673) entered disabled state Jun 3 12:26:28 USDALXKADV01 networkd-dispatcher[763]: ERROR:Unknown interface index 173 seen even after reload Jun 3 12:26:28 USDALXKADV01 systemd-timesyncd[499]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com). Jun 3 12:26:28 USDALXKADV01 systemd[1]: Starting OpenNebula delayed reconfiguration script... Jun 3 12:26:28 USDALXKADV01 systemd[1]: Started OpenNebula delayed reconfiguration script. Jun 3 12:27:28 USDALXKADV01 one-contextd[15036]: Started for type all to reconfigure Jun 3 12:27:28 USDALXKADV01 one-contextd[15040]: Acquiring lock /var/run/one-context/one-context.lock Jun 3 12:27:28 USDALXKADV01 one-contextd[15042]: Acquired lock /var/run/one-context/one-context.lock Jun 3 12:27:29 USDALXKADV01 one-contextd[15055]: Reading context via vmtoolsd Jun 3 12:27:29 USDALXKADV01 one-contextd[15064]: Comparing /var/run/one-context/context.sh.0KXIk7 and /var/run/one-context/context.sh.local for changes Jun 3 12:27:29 USDALXKADV01 one-contextd[15066]: No changes in context, skipping Jun 3 12:27:29 USDALXKADV01 one-contextd[15067]: Comparing /var/run/one-context/context.sh.0KXIk7 and /var/run/one-context/context.sh.network for changes Jun 3 12:27:29 USDALXKADV01 one-contextd[15069]: No changes in context, skipping Jun 3 12:27:29 USDALXKADV01 one-contextd[15070]: Done Jun 3 12:27:29 USDALXKADV01 one-contextd[15071]: Unmounting /var/run/one-context/mount.VEaypB Jun 3 12:27:29 USDALXKADV01 one-contextd[15075]: Releasing lock /var/run/one-context/one-context.lock Jun 3 12:29:15 USDALXKADV01 systemd-udevd[15148]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Jun 3 12:29:15 USDALXKADV01 systemd-networkd[1398]: br-342fcad19ff7: Link UP Jun 3 12:29:15 USDALXKADV01 systemd-timesyncd[499]: Network configuration changed, trying to establish connection. Jun 3 12:29:15 USDALXKADV01 networkd-dispatcher[763]: WARNING:Unknown index 177 seen, reloading interface list Jun 3 12:29:15 USDALXKADV01 kernel: [1397500.520269] IPv6: ADDRCONF(NETDEV_UP): br-342fcad19ff7: link is not ready Jun 3 12:29:16 USDALXKADV01 systemd-timesyncd[499]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com). Jun 3 12:29:16 USDALXKADV01 systemd[1]: Starting OpenNebula delayed reconfiguration script... Jun 3 12:29:16 USDALXKADV01 systemd[1]: Started OpenNebula delayed reconfiguration script. Jun 3 12:29:16 USDALXKADV01 systemd-timesyncd[499]: Network configuration changed, trying to establish connection. Jun 3 12:29:16 USDALXKADV01 systemd-timesyncd[499]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com). Jun 3 12:30:16 USDALXKADV01 one-contextd[15249]: Started for type all to reconfigure Jun 3 12:30:16 USDALXKADV01 one-contextd[15253]: Acquiring lock /var/run/one-context/one-context.lock Jun 3 12:30:16 USDALXKADV01 one-contextd[15255]: Acquired lock /var/run/one-context/one-context.lock Jun 3 12:30:16 USDALXKADV01 one-contextd[15268]: Reading context via vmtoolsd Jun 3 12:30:16 USDALXKADV01 one-contextd[15277]: Comparing /var/run/one-context/context.sh.JXXPUJ and /var/run/one-context/context.sh.local for changes Jun 3 12:30:16 USDALXKADV01 one-contextd[15279]: No changes in context, skipping Jun 3 12:30:16 USDALXKADV01 one-contextd[15280]: Comparing /var/run/one-context/context.sh.JXXPUJ and /var/run/one-context/context.sh.network for changes Jun 3 12:30:16 USDALXKADV01 one-contextd[15282]: No changes in context, skipping Jun 3 12:30:16 USDALXKADV01 one-contextd[15283]: Done Jun 3 12:30:16 USDALXKADV01 one-contextd[15284]: Unmounting /var/run/one-context/mount.arNJ1s Jun 3 12:30:16 USDALXKADV01 one-contextd[15288]: Releasing lock /var/run/one-context/one-context.lock
什麼會導致這個問題?我寧願不必重新安裝 docker,因為這似乎是一個創可貼,並沒有解決問題的癥結所在。這似乎與 docker 如何設置網路有關,但我不是 100% 肯定這是問題所在。
我正在使用 Docker 版本 19.03.8,建構 afacb8b7f0
更新
我通過做了一件非常糟糕的事情並在
/var/lib/docker/network/files
. 刪除網路允許 docker 順利啟動。雖然這解決了這個問題,但它仍然沒有給我一個關於這裡發生了什麼的明確答案。看來,在這台特定機器上使用預設的 docker 配置,您在任何給定時間可以擁有的網路數量都有上限
也許:
- 您嘗試使用已在另一台設備上設置的 IP 範圍,這會破壞您的路線。
預設網路範圍在 /etc/docker/daemon.json 中設置:
{ "default-address-pools": [ {"base":"10.10.0.0/16","size":24} ] }
- 也可能是您達到了軟/硬限制
- 或者只是破壞網路事物,因為您的虛擬機沒有足夠的記憶體分配(shm)用於網路堆棧 cat /etc/sysctl.conf |grep vm。
確實你應該調整:
系統可以處理的打開文件的數量http://www.dba-oracle.com/t_increase_number_of_open_file_descriptors.htm
- 一定要禁用交換
調查您的 sysctl 值,尤其是那些以“vm”開頭的值。
(調整例如:
sysctl-w vm.swappiness=10;sysctl -w vm.max_map_count=262144
)Kubernetes、Docker 和 vm.max_map_count