Nvidia Config 後 Containerd 無法啟動
我已經按照這個官方教程允許裸機 k8s 集群具有 GPU 訪問權限。但是我在這樣做時收到了錯誤。
Kubernetes 1.21 containerd 1.4.11 和 Ubuntu 20.04.3 LTS(GNU/Linux 5.4.0-91-generic x86_64)。
Nvidia 驅動程序預裝在系統作業系統上,版本為 495 Headless
將以下配置粘貼到裡面
/etc/containerd/config.toml
並執行服務重啟後,containerd 將無法以exit 1
.容器化配置.toml
系統日誌在這裡。
# persistent data location root = "/var/lib/containerd" # runtime state information state = "/run/containerd" # Kubernetes doesn't use containerd restart manager. disabled_plugins = ["restart"] # NVIDIA CONFIG START HERE version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime" # NVIDIA CONFIG ENDS HERE [debug] level = "" [grpc] max_recv_message_size = 16777216 max_send_message_size = 16777216 [plugins.linux] shim = "/usr/bin/containerd-shim" runtime = "/usr/bin/runc"
我可以確認 Nvidia Driver 確實通過執行檢測到 GPU(Nvidia GTX 750Ti)
nvidia-smi
並得到以下輸出+-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | 34% 34C P8 1W / 38W | 0MiB / 2000MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
修改了config.toml讓它工作。
據我所知,是這樣的:
12 月 2 日 03:15:36 k8s-node0 containerd
$$ 2179737 $$:containerd:無效的禁用外掛URI“重啟”期望io.containerd.x.vx 12 月 2 日 03:15:36 k8s-node0 systemd
$$ 1 $$: containerd.service: 主程序退出,code=exited,status=1/FAILURE
因此,如果您知道
restart
-ish 外掛實際上已啟用,則需要跟踪其新的 URI 語法,但我實際上建議僅註釋掉該節,或使用disabled_plugins = []
,因為我們使用的 containerd ansible 角色沒有不要提及任何關於“重新啟動”的內容,並且確實有= []
味道切線地,您可能希望
journalctl
將來將呼叫限制為僅查看containerd.service
,因為它會拋出很多令人分心的文本:journalctl -u containerd.service
您甚至可以將其限制在最後幾行,這有時可以幫助進一步:journalctl -u containerd.service --lines=250