Https

Kubernetes 節點指標端點返回 401

  • May 10, 2017

我有一個 GKE 集群,為了簡單起見,它只執行 Prometheus,監控每個成員節點。最近我最近將 API 伺服器升級到 1.6(引入了 RBAC),沒有任何問題。然後我添加了一個新節點,執行 1.6 版 kubelet。Prometheus 無法訪問此新節點的指標 API。

Prometheus 目標頁面

因此,我在命名空間中添加了和 a ClusterRole,並將部署配置為使用新的 ServiceAccount。然後我刪除了 pod 以進行良好的衡量:ClusterRoleBinding``ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
 name: prometheus
---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
 name: prometheus
rules:
- apiGroups: [""]
 resources:
 - nodes
 - services
 - endpoints
 - pods
 verbs: ["get", "list", "watch"]
- apiGroups: [""]
 resources:
 - configmaps
 verbs: ["get"]
- nonResourceURLs: ["/metrics"]
 verbs: ["get"]
---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
 name: prometheus
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: prometheus
subjects:
- kind: ServiceAccount
 name: prometheus
 namespace: default
---

apiVersion: v1
kind: ServiceAccount
metadata:
 name: prometheus
 namespace: default
secrets:
- name: prometheus-token-xxxxx

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 labels:
   app: prometheus-prometheus
   component: server
   release: prometheus
 name: prometheus-server
 namespace: default
spec:
 replicas: 1
 selector:
   matchLabels:
     app: prometheus-prometheus
     component: server
     release: prometheus
 strategy:
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 1
   type: RollingUpdate
 template:
   metadata:
     labels:
       app: prometheus-prometheus
       component: server
       release: prometheus
   spec:
     dnsPolicy: ClusterFirst
     restartPolicy: Always
     schedulerName: default-scheduler
     serviceAccount: prometheus
     serviceAccountName: prometheus
     ...

但情況仍然沒有改變。

指標端點返回HTTP/1.1 401 Unauthorized,當我修改部署以包含另一個安裝了 bash + curl 的容器並手動發出請求時,我得到:

# curl -vsSk -H "Authorization: Bearer $(</var/run/secrets/kubernetes.io/serviceaccount/token)" https://$NODE_IP:10250/metrics
*   Trying $NODE_IP...
* Connected to $NODE_IP ($NODE_IP) port 10250 (#0)
* found XXX certificates in /etc/ssl/certs/ca-certificates.crt
* found XXX certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*    server certificate verification SKIPPED
*    server certificate status verification SKIPPED
*    common name: node-running-kubelet-1-6@000000000 (does not match '$NODE_IP')
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: RSA
*    certificate version: #3
*    subject: CN=node-running-kubelet-1-6@000000000
*    start date: Fri, 07 Apr 2017 22:00:00 GMT
*    expire date: Sat, 07 Apr 2018 22:00:00 GMT
*    issuer: CN=node-running-kubelet-1-6@000000000
*    compression: NULL
* ALPN, server accepted to use http/1.1
> GET /metrics HTTP/1.1
> Host: $NODE_IP:10250
> User-Agent: curl/7.47.0
> Accept: */*
> Authorization: Bearer **censored**
>
< HTTP/1.1 401 Unauthorized
< Date: Mon, 10 Apr 2017 20:04:20 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host $NODE_IP left intact
  • 為什麼該令牌不允許我訪問該資源?
  • 如何檢查授予 ServiceAccount 的訪問權限?

根據關於@JorritSalverda 票證的討論;https://github.com/prometheus/prometheus/issues/2606#issuecomment-294869099

由於 GKE 不允許您獲取允許您使用 kubelet 進行身份驗證的客戶端證書,因此 GKE 上使用者的最佳解決方案似乎使用 kubernetes API 伺服器作為對節點的代理請求。

為此(引用@JorritSalverda);

“對於在 GKE 中執行的 Prometheus 伺服器,我現在可以通過以下重新標記執行它:

relabel_configs:
- action: labelmap
 regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
 replacement: kubernetes.default.svc.cluster.local:443
- target_label: __scheme__
 replacement: https
- source_labels: [__meta_kubernetes_node_name]
 regex: (.+)
 target_label: __metrics_path__
 replacement: /api/v1/nodes/${1}/proxy/metrics

並且以下 ClusterRole 綁定到 Prometheus 使用的服務帳戶:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
 name: prometheus
rules:
- apiGroups: [""]
 resources:
 - nodes
 - nodes/proxy
 - services
 - endpoints
 - pods
 verbs: ["get", "list", "watch"]

因為 GKE 集群在 RBAC 失敗的情況下仍然有 ABAC 回退,所以我不能 100% 確定這是否涵蓋了所有必需的權限。

我遇到了同樣的問題並為此創建了票https://github.com/prometheus/prometheus/issues/2606並在討論中通過 PR https://github.com/prometheus/prometheus/pull更新了配置範例/2641

您可以在https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml#L76-L84查看kubernetes-nodes作業的更新重新標記

複製參考:

 relabel_configs:
 - action: labelmap
   regex: __meta_kubernetes_node_label_(.+)
 - target_label: __address__
   replacement: kubernetes.default.svc:443
 - source_labels: [__meta_kubernetes_node_name]
   regex: (.+)
   target_label: __metrics_path__
   replacement: /api/v1/nodes/${1}/proxy/metrics

對於 RBAC 本身,您需要使用自己創建的服務帳戶執行 Prometheus

apiVersion: v1
kind: ServiceAccount
metadata:
 name: prometheus
 namespace: default

確保使用以下 pod 規範將該服務帳戶傳遞到 pod 中:

spec:
 serviceAccount: prometheus

然後 Kubernetes 清單用於設置適當的 RBAC 角色和綁定,以使 prometheus 服務帳戶訪問https://github.com/prometheus/prometheus/blob/master/documentation/examples/rbac-setup所需的 API 端點.yml

複製參考

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
 name: prometheus
rules:
- apiGroups: [""]
 resources:
 - nodes
 - nodes/proxy
 - services
 - endpoints
 - pods
 verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
 verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
 name: prometheus
 namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
 name: prometheus
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: prometheus
subjects:
- kind: ServiceAccount
 name: prometheus
 namespace: default

將所有清單中的命名空間替換為與您執行 Prometheus 的名稱相對應,然後使用具有集群管理員權限的帳戶應用清單。

我沒有在沒有 ABAC 回退的集群中對此進行測試,因此 RBAC 角色可能仍然缺少一些重要的東西。

引用自:https://serverfault.com/questions/843751