Prometheus 未連接到 GKE 中的警報管理器
我使用 helm(在“monitoring”命名空間中)將 kube-prometheus-stack 15.3.1 安裝到 GKE 集群中。我使用
values.yaml
來打開某些組件的入口,並將 SMTP 資訊和接收者詳細資訊添加到警報管理器中。在大多數情況下,一切似乎都很好,除了 Prometheus 發出了許多警報,而且我沒有收到任何警報電子郵件。一個觸發警報是:
PrometheusNotConnectedToAlertmanagers
Prometheus 監控/prometheus-kube-prometheus-stak-prometheus-0 沒有連接到任何Alertmanagers
另一個是:
PrometheusOperatorSyncFailed
監控命名空間中的控制器警報管理器無法協調 1 個對象。
我也嘗試打開警報管理器的入口並指向
alerts.mydomiain.com
它,但是當我嘗試任何 GET 請求(例如alerts.mydomain.com/v2/status
)時,我總是會收到 502 伺服器錯誤。我需要做什麼才能讓我的 alertmanager 正常工作?
這是輸出
kubectl get pods,svc,daemonset,deployment,statefulset -n monitoring
:NAME READY STATUS RESTARTS AGE pod/kube-prometheus-stack-grafana-58f7fcb497-hm72h 2/2 Running 0 30h pod/kube-prometheus-stack-kube-state-metrics-6d588499f5-d957b 1/1 Running 0 2d3h pod/kube-prometheus-stack-operator-54f89674c9-k8ml7 1/1 Running 0 2d3h pod/kube-prometheus-stack-prometheus-node-exporter-22vpd 1/1 Running 0 3h57m pod/kube-prometheus-stack-prometheus-node-exporter-2qsl9 1/1 Running 0 3h57m pod/kube-prometheus-stack-prometheus-node-exporter-4d27n 1/1 Running 0 7h36m pod/kube-prometheus-stack-prometheus-node-exporter-7rlnk 1/1 Running 0 4h47m pod/kube-prometheus-stack-prometheus-node-exporter-7xlf4 1/1 Running 0 4h51m pod/kube-prometheus-stack-prometheus-node-exporter-9mfnt 1/1 Running 0 3h57m pod/kube-prometheus-stack-prometheus-node-exporter-9zblf 1/1 Running 0 2d3h pod/kube-prometheus-stack-prometheus-node-exporter-bdcjj 1/1 Running 0 2d3h pod/kube-prometheus-stack-prometheus-node-exporter-bs54w 1/1 Running 0 4h47m pod/kube-prometheus-stack-prometheus-node-exporter-fp95h 1/1 Running 0 2d3h pod/kube-prometheus-stack-prometheus-node-exporter-h4zhw 1/1 Running 0 2d3h pod/kube-prometheus-stack-prometheus-node-exporter-pz8js 1/1 Running 0 3h58m pod/kube-prometheus-stack-prometheus-node-exporter-rrrhk 1/1 Running 0 27h pod/kube-prometheus-stack-prometheus-node-exporter-rszlt 1/1 Running 0 2d3h pod/kube-prometheus-stack-prometheus-node-exporter-s62wq 1/1 Running 0 4h47m pod/kube-prometheus-stack-prometheus-node-exporter-w9dmb 1/1 Running 0 5h32m pod/kube-prometheus-stack-prometheus-node-exporter-xqmxk 1/1 Running 0 4h51m pod/prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 1 30h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kube-prometheus-stack-alertmanager NodePort 10.125.4.161 <none> 9093:30903/TCP 2d3h service/kube-prometheus-stack-grafana NodePort 10.125.7.177 <none> 80:32444/TCP 2d3h service/kube-prometheus-stack-kube-state-metrics ClusterIP 10.125.2.56 <none> 8080/TCP 2d3h service/kube-prometheus-stack-operator ClusterIP 10.125.4.171 <none> 443/TCP 2d3h service/kube-prometheus-stack-prometheus NodePort 10.125.13.11 <none> 9090:30090/TCP 2d3h service/kube-prometheus-stack-prometheus-node-exporter ClusterIP 10.125.10.231 <none> 9100/TCP 2d3h service/prometheus-operated ClusterIP None <none> 9090/TCP 2d3h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/kube-prometheus-stack-prometheus-node-exporter 17 17 17 17 17 <none> 2d3h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/kube-prometheus-stack-grafana 1/1 1 1 2d3h deployment.apps/kube-prometheus-stack-kube-state-metrics 1/1 1 1 2d3h deployment.apps/kube-prometheus-stack-operator 1/1 1 1 2d3h NAME READY AGE statefulset.apps/prometheus-kube-prometheus-stack-prometheus 1/1 42h
我意識到即使服務在那裡,alertmanager pod 也失去了。我發現我可以通過解除安裝 prometheus 堆棧然後用預設值重新安裝它,然後用我自己的值升級它來取回 pod。
現在 PrometheusNotConnectedToAlertmanagers 警報已停止觸發,但我仍然沒有收到電子郵件。現在我可以通過入口訪問警報管理器,並看到我放入 Helm 值文件中的配置沒有通過警報管理器 - 它仍然具有預設配置。
我發現我遇到了此處描述的問題,並檢查 kube-prometheus-stack 操作員 pod 中的日誌確認了它。我需要在我的警報管理器接收器中有一個“空”接收器(我已將其刪除)