最近需要使用prometheus监控kubernetes环境下的一些pod状态,定义了一些alert,分享一下:
sum(kube_pod_container_status_restarts_total{namespace="your_service_ns"}) by (cluster, namespace, pod, container) > 10
increase(kube_pod_container_status_restarts_total{namespace="your_service_ns"}[1m]) > 3
kube_pod_container_status_terminated_reason{reason=~"OOMKilled|Error|ContainerCannotRun", namespace="your_service_ns"} > 0
min_over_time(sum by (cluster, namespace, pod, container) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",namespace="your_service_ns"})[15m:1m]) > 0
kube_deployment_status_replicas_available{namespace="your_service_ns"} != kube_deployment_spec_replicas{namespace="your_service_ns"}
kube_statefulset_status_replicas_available{namespace="your_service_ns"} != kube_statefulset_replicas{namespace="your_service_ns"}
LEo at 00:12