Monitoring

Prometheus 未連接到 GKE 中的警報管理器

  • October 19, 2021

我使用 helm(在“monitoring”命名空間中)將 kube-prometheus-stack 15.3.1 安裝到 GKE 集群中。我使用values.yaml來打開某些組件的入口,並將 SMTP 資訊和接收者詳細資訊添加到警報管理器中。在大多數情況下,一切似乎都很好,除了 Prometheus 發出了許多警報,而且我沒有收到任何警報電子郵件。

一個觸發警報是:

PrometheusNotConnectedToAlertmanagers

Prometheus 監控/prometheus-kube-prometheus-stak-prometheus-0 沒有連接到任何Alertmanagers

另一個是:

PrometheusOperatorSyncFailed

監控命名空間中的控制器警報管理器無法協調 1 個對象。

我也嘗試打開警報管理器的入口並指向alerts.mydomiain.com它,但是當我嘗試任何 GET 請求(例如alerts.mydomain.com/v2/status)時,我總是會收到 502 伺服器錯誤。

我需要做什麼才能讓我的 alertmanager 正常工作?

這是輸出kubectl get pods,svc,daemonset,deployment,statefulset -n monitoring

NAME                                                            READY   STATUS    RESTARTS   AGE
pod/kube-prometheus-stack-grafana-58f7fcb497-hm72h              2/2     Running   0          30h
pod/kube-prometheus-stack-kube-state-metrics-6d588499f5-d957b   1/1     Running   0          2d3h
pod/kube-prometheus-stack-operator-54f89674c9-k8ml7             1/1     Running   0          2d3h
pod/kube-prometheus-stack-prometheus-node-exporter-22vpd        1/1     Running   0          3h57m
pod/kube-prometheus-stack-prometheus-node-exporter-2qsl9        1/1     Running   0          3h57m
pod/kube-prometheus-stack-prometheus-node-exporter-4d27n        1/1     Running   0          7h36m
pod/kube-prometheus-stack-prometheus-node-exporter-7rlnk        1/1     Running   0          4h47m
pod/kube-prometheus-stack-prometheus-node-exporter-7xlf4        1/1     Running   0          4h51m
pod/kube-prometheus-stack-prometheus-node-exporter-9mfnt        1/1     Running   0          3h57m
pod/kube-prometheus-stack-prometheus-node-exporter-9zblf        1/1     Running   0          2d3h
pod/kube-prometheus-stack-prometheus-node-exporter-bdcjj        1/1     Running   0          2d3h
pod/kube-prometheus-stack-prometheus-node-exporter-bs54w        1/1     Running   0          4h47m
pod/kube-prometheus-stack-prometheus-node-exporter-fp95h        1/1     Running   0          2d3h
pod/kube-prometheus-stack-prometheus-node-exporter-h4zhw        1/1     Running   0          2d3h
pod/kube-prometheus-stack-prometheus-node-exporter-pz8js        1/1     Running   0          3h58m
pod/kube-prometheus-stack-prometheus-node-exporter-rrrhk        1/1     Running   0          27h
pod/kube-prometheus-stack-prometheus-node-exporter-rszlt        1/1     Running   0          2d3h
pod/kube-prometheus-stack-prometheus-node-exporter-s62wq        1/1     Running   0          4h47m
pod/kube-prometheus-stack-prometheus-node-exporter-w9dmb        1/1     Running   0          5h32m
pod/kube-prometheus-stack-prometheus-node-exporter-xqmxk        1/1     Running   0          4h51m
pod/prometheus-kube-prometheus-stack-prometheus-0               2/2     Running   1          30h

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/kube-prometheus-stack-alertmanager               NodePort    10.125.4.161    <none>        9093:30903/TCP   2d3h
service/kube-prometheus-stack-grafana                    NodePort    10.125.7.177    <none>        80:32444/TCP     2d3h
service/kube-prometheus-stack-kube-state-metrics         ClusterIP   10.125.2.56     <none>        8080/TCP         2d3h
service/kube-prometheus-stack-operator                   ClusterIP   10.125.4.171    <none>        443/TCP          2d3h
service/kube-prometheus-stack-prometheus                 NodePort    10.125.13.11    <none>        9090:30090/TCP   2d3h
service/kube-prometheus-stack-prometheus-node-exporter   ClusterIP   10.125.10.231   <none>        9100/TCP         2d3h
service/prometheus-operated                              ClusterIP   None            <none>        9090/TCP         2d3h

NAME                                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/kube-prometheus-stack-prometheus-node-exporter   17        17        17      17           17          <none>          2d3h

NAME                                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kube-prometheus-stack-grafana              1/1     1            1           2d3h
deployment.apps/kube-prometheus-stack-kube-state-metrics   1/1     1            1           2d3h
deployment.apps/kube-prometheus-stack-operator             1/1     1            1           2d3h

NAME                                                           READY   AGE
statefulset.apps/prometheus-kube-prometheus-stack-prometheus   1/1     42h

我意識到即使服務在那裡,alertmanager pod 也失去了。我發現我可以通過解除安裝 prometheus 堆棧然後用預設值重新安裝它,然後用我自己的值升級它來取回 pod。

現在 PrometheusNotConnectedToAlertmanagers 警報已停止觸發,但我仍然沒有收到電子郵件。現在我可以通過入口訪問警報管理器,並看到我放入 Helm 值文件中的配置沒有通過警報管理器 - 它仍然具有預設配置。

我發現我遇到了此處描述的問題,並檢查 kube-prometheus-stack 操作員 pod 中的日誌確認了它。我需要在我的警報管理器接收器中有一個“空”接收器(我已將其刪除)

引用自:https://serverfault.com/questions/1062725