GKE 無法在具有 GPU 的新添加節點上調度需要 GPU 的新創建的 Pod

July 29, 2020

當使用 GPU 添加新的池節點時，Google Kubernetes Engine 無法在這些新節點上安排需要 GPU 的新創建的 pod，應該是自動的，但我猜不是 GPU 資源，新的 pod 永遠處於“待定”狀態，如何解決這個問題?

編輯：這是部署 yaml 文件，我的目標是不將部署綁定到特定節點：

   ---
   apiVersion: machinelearning.seldon.io/v1alpha2
   kind: SldDeployment
   metadata:
     labels:
       app: sld
     name: trs-sld
     namespace: trs
   spec:
     annotations:
       project_name: Trs
       deployment_version: v1.0
       seldon.io/rest-connect-retries: '5'
       seldon.io/grpc-connect-retries: '5'
       seldon.io/istio-retries: '10' 
       seldon.io/istio-retries-timeout: '12' 
     name: trs
     predictors:
     - componentSpecs:
       - spec:
           containers:
           - image: eu.gcr.io/trs-141513/trs-native:latest
             imagePullPolicy: Always
             name: classifier
             resources:
               limits:
                 nvidia.com/gpu: 2
             volumeMounts:
               - mountPath: /etc/google_storage/creds
                 name: service-account-creds
                 readOnly: true
           volumes:
             - name: service-account-creds
               secret:
                 secretName: service-account-creds
           terminationGracePeriodSeconds: 20
       graph:
         children: []
         name: classifier
         endpoint:
           type: REST
         type: MODEL
       name: model
       replicas: 1
       annotations:
         predictor_version: v1.0
   ---

事實證明，每次添加新節點時都需要安裝 GPU 驅動程序，例如，對於 Ubuntu 容器：
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

引用自：https://serverfault.com/questions/1025600

GKE 無法在具有 GPU 的新添加節點上調度需要 GPU 的新創建的 Pod

相關問答

Google Kubernetes Engine 節點池不會從 0 個節點自動擴縮

超出 GKE 上下文期限：CreateContainerError 並且未能保留容器名稱

損壞的 GKE 後端執行狀況檢查預設值

如何從不同的應用程序觸發 k8s Job？

Google Cloud Run - 如何掛載 FileStore / NFS？

無法訪問私有 GKE 集群中 pod 上的網際網路