Kubernetes

Nvidia Config 後 Containerd 無法啟動

  • December 3, 2021

我已經按照這個官方教程允許裸機 k8s 集群具有 GPU 訪問權限。但是我在這樣做時收到了錯誤。

Kubernetes 1.21 containerd 1.4.11 和 Ubuntu 20.04.3 LTS(GNU/Linux 5.4.0-91-generic x86_64)。

Nvidia 驅動程序預裝在系統作業系統上,版本為 495 Headless

將以下配置粘貼到裡面/etc/containerd/config.toml並執行服務重啟後,containerd 將無法以exit 1.

容器化配置.toml

系統日誌在這裡

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"

# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]

# NVIDIA CONFIG START HERE

version = 2
[plugins]
 [plugins."io.containerd.grpc.v1.cri"]
   [plugins."io.containerd.grpc.v1.cri".containerd]
     default_runtime_name = "nvidia"

     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
         privileged_without_host_devices = false
         runtime_engine = ""
         runtime_root = ""
         runtime_type = "io.containerd.runc.v2"
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
           BinaryName = "/usr/bin/nvidia-container-runtime"

# NVIDIA CONFIG ENDS HERE

[debug]
 level = ""

[grpc]
 max_recv_message_size = 16777216
 max_send_message_size = 16777216

[plugins.linux]
 shim = "/usr/bin/containerd-shim"
 runtime = "/usr/bin/runc"

我可以確認 Nvidia Driver 確實通過執行檢測到 GPU(Nvidia GTX 750Ti)nvidia-smi並得到以下輸出

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   34C    P8     1W /  38W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

修改了config.toml讓它工作。

據我所知,是這樣的:

12 月 2 日 03:15:36 k8s-node0 containerd

$$ 2179737 $$:containerd:無效的禁用外掛URI“重啟”期望io.containerd.x.vx 12 月 2 日 03:15:36 k8s-node0 systemd

$$ 1 $$: containerd.service: 主程序退出,code=exited,status=1/FAILURE

因此,如果您知道restart-ish 外掛實際上已啟用,則需要跟踪其新的 URI 語法,但我實際上建議僅註釋掉該節,或使用disabled_plugins = [],因為我們使用的 containerd ansible 角色沒有不要提及任何關於“重新啟動”的內容,並且確實有= []味道


切線地,您可能希望journalctl將來將呼叫限制為僅查看containerd.service,因為它會拋出很多令人分心的文本:journalctl -u containerd.service您甚至可以將其限制在最後幾行,這有時可以幫助進一步:journalctl -u containerd.service --lines=250

引用自:https://serverfault.com/questions/1085068