cAdvisor+Heapster+InfluxDB+Grafana | Y | 简单 | 容器监控 |
---|---|---|---|
cAdvisor/exporter+Prometheus+Grafana | Y | 扩展性好 | 容器,应用,主机全方面监控 |
Prometheus+Grafana是监控告警解决方案里的后起之秀
通过各种exporter采集不同维度的监控指标,并通过Prometheus支持的数据格式暴露出来,Prometheus定期pull数据并用Grafana展示,异常情况使用AlertManager告警。
通过cadvisor采集容器、Pod相关的性能指标数据,并通过暴露的/metrics接口用prometheus抓取
通过prometheus-node-exporter采集主机的性能指标数据,并通过暴露的/metrics接口用prometheus抓取
应用侧自己采集容器中进程主动暴露的指标数据(暴露指标的功能由应用自己实现,并添加平台侧约定的annotation,平台侧负责根据annotation实现通过Prometheus的抓取)
通过kube-state-metrics采集k8s资源对象的状态指标数据,并通过暴露的/metrics接口用prometheus抓取
通过etcd、kubelet、kube-apiserver、kube-controller-manager、kube-scheduler自身暴露的/metrics获取节点上与k8s集群相关的一些特征指标数据。
实现思路
监控指标 | 具体实现 | 举例 |
---|---|---|
Pod性能 | cAdvisor | 容器CPU,内存利用率 |
Node性能 | node-exporter | 节点CPU,内存利用率 |
K8S资源对象 | kube-state-metrics | Pod/Deployment/Service |
官网:https://prometheus.io
下载yaml文件:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
修改yaml文件
#使用nfs存储[root@localhost prometheus]# kubectl get storageclassNAME PROVISIONER AGEmanaged-nfs-storage fuseim.pri/ifs 9d[root@localhost prometheus]# sed -i s/standard/managed-nfs-storage/ prometheus-statefulset.yaml #修改service使用NOdePort[root@localhost prometheus]# vim prometheus-service.yaml 。。。。spec: type: NodePort ports: - name: http port: 9090 protocol: TCP targetPort: 9090 selector: k8s-app: prometheus
启动prometheus
[root@localhost prometheus]# kubectl apply -f prometheus-rbac.yaml serviceaccount/prometheus createdclusterrole.rbac.authorization.k8s.io/prometheus createdclusterrolebinding.rbac.authorization.k8s.io/prometheus created[root@localhost prometheus]# kubectl apply -f prometheus-configmap.yaml configmap/prometheus-config created[root@localhost prometheus]# kubectl apply -f prometheus-statefulset.yaml statefulset.apps/prometheus created[root@localhost prometheus]# vim prometheus-service.yaml [root@localhost prometheus]# kubectl apply -f prometheus-service.yamlservice/prometheus created
查看
[root@localhost prometheus]# kubectl get pod,svc -n kube-systemNAME READY STATUS RESTARTS AGEpod/coredns-5b8c57999b-z9jh8 1/1 Running 1 16dpod/kubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 16dpod/prometheus-0 2/2 Running 0 2m40s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 16dservice/kubernetes-dashboard NodePort 10.0.0.84 <none> 443:30001/TCP 16dservice/prometheus NodePort 10.0.0.89 <none> 9090:41782/TCP 39s[root@localhost prometheus]# kubectl get pv,pvc -n kube-systemNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGEpersistentvolume/kube-system-prometheus-data-prometheus-0-pvc-0e92f36c-8d9e-11e9-b018-525400828c1f 16Gi RWO Delete Bound kube-system/prometheus-data-prometheus-0 managed-nfs-storage 25m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEpersistentvolumeclaim/prometheus-data-prometheus-0 Bound kube-system-prometheus-data-prometheus-0-pvc-0e92f36c-8d9e-11e9-b018-525400828c1f 16Gi RWO managed-nfs-storage 25m
访问
[root@localhost prometheus]# cat grafana.yaml apiVersion: apps/v1kind: StatefulSetmetadata: name: grafana namespace: kube-systemspec: serviceName: "grafana" replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana resources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi volumeMounts: - name: grafana-data mountPath: /var/lib/grafana subPath: grafana securityContext: fsGroup: 472 runAsUser: 472 volumeClaimTemplates: - metadata: name: grafana-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "1Gi" --- apiVersion: v1kind: Servicemetadata: name: grafana namespace: kube-systemspec: type: NodePort ports: - port: 80 targetPort: 3000 nodePort: 30007 selector: app: grafana [root@localhost prometheus]# kubectl apply -f grafana.yaml statefulset.apps/grafana createdservice/grafana created [root@localhost prometheus]# kubectl get pod,svc -n kube-systemNAME READY STATUS RESTARTS AGEpod/coredns-5b8c57999b-z9jh8 1/1 Running 1 17dpod/grafana-0 1/1 Running 0 45spod/kubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 17dpod/prometheus-0 2/2 Running 0 25h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/grafana NodePort 10.0.0.78 <none> 80:30007/TCP 44sservice/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 17dservice/kubernetes-dashboard NodePort 10.0.0.84 <none> 443:30001/TCP 17dservice/prometheus NodePort 10.0.0.89 <none> 9090:41782/TCP 25h
访问
kubelet的节点使用cAdvisor提供的metrics接口获取该节点所有容器相关的性能指标数据。
暴露接口地址:
https://NodeIP:10255/metrics/cadvisor
https://NodeIP:10250/metrics/cadvisor
导入grafana模板
https://grafana.com/grafana/download
集群资源监控:3119
使用文档:https://prometheus.io/docs/guides/node-exporter/
GitHub:https://github.com/prometheus/node_exporter
exporter列表:https://prometheus.io/docs/instrumenting/exporters/
所有node节点部署node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz tar zxf node_exporter-0.17.0.linux-amd64.tar.gzmv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter cat <<EOF >/usr/lib/systemd/system/node_exporter.service[Unit]Description=https://prometheus.io [Service]Restart=on-failureExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service [Install]WantedBy=multi-user.targetEOF systemctl daemon-reloadsystemctl enable node_exportersystemctl restart node_exporter
修改prometheus-configmap.yaml,并重新部署
查看prometheus是否收集到kubernetes-nodes
导入grafana模板
集群资源监控:9276
https://github.com/kubernetes/kube-state-metrics
kube-state-metrics是一个简单的服务,它监听Kubernetes API服务器并生成有关对象状态的指标。它不关注单个Kubernetes组件的运行状况,而是关注内部各种对象的运行状况,例如部署,节点和容器。
[root@localhost prometheus]# kubectl apply -f kube-state-metrics-rbac.yaml serviceaccount/kube-state-metrics createdclusterrole.rbac.authorization.k8s.io/kube-state-metrics createdrole.rbac.authorization.k8s.io/kube-state-metrics-resizer createdclusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics createdrolebinding.rbac.authorization.k8s.io/kube-state-metrics created[root@localhost prometheus]# vim kube-state-metrics-deployment.yaml [root@localhost prometheus]# kubectl apply -f kube-state-metrics-deployment.yamldeployment.apps/kube-state-metrics createdconfigmap/kube-state-metrics-config created[root@localhost prometheus]# kubectl apply -f kube-state-metrics-service.yaml service/kube-state-metrics created
导入grafana模板
集群资源监控:6417
部署Alertmanager
[root@localhost prometheus]# sed -i s/standard/managed-nfs-storage/ alertmanager-pvc.yaml[root@localhost prometheus]# kubectl apply -f alertmanager-configmap.yaml configmap/alertmanager-config created[root@localhost prometheus]# kubectl apply -f alertmanager-pvc.yaml persistentvolumeclaim/alertmanager created[root@localhost prometheus]# kubectl apply -f alertmanager-deployment.yaml deployment.apps/alertmanager created[root@localhost prometheus]# kubectl apply -f alertmanager-service.yaml service/alertmanager created [root@localhost prometheus]# kubectl get pod -n kube-systemNAME READY STATUS RESTARTS AGEalertmanager-6b5bbd5bd4-lgjn8 2/2 Running 0 95scoredns-5b8c57999b-z9jh8 1/1 Running 1 20dgrafana-0 1/1 Running 3 2d22hkube-state-metrics-f86fd9f4f-j4rdc 2/2 Running 0 3h2mkubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 20dprometheus-0 2/2 Running 0 4d
配置Prometheus与Alertmanager通信
[root@localhost prometheus]# vim prometheus-configmap.yaml。。。。 alerting: alertmanagers: - static_configs: - targets: ["alertmanager:80"][root@localhost prometheus]# kubectl apply -f prometheus-configmap.yaml configmap/prometheus-config configured
配置告警
prometheus指定rules目录
configmap存储告警规则
[root@localhost prometheus]# cat prometheus-rules.yaml apiVersion: v1kind: ConfigMapmetadata: name: prometheus-rules namespace: kube-systemdata: general.rules: | groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} 停止工作" description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上." node.rules: | groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高" description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 内存使用率过高" description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率过高" description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})" [root@localhost prometheus]# kubectl apply -f prometheus-rules.yamlconfigmap/prometheus-rules created
configmap挂载到容器rules目录
[root@localhost prometheus]# vim prometheus-statefulset.yaml...... volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data subPath: "" - name: prometheus-rules mountPath: /etc/config/rules terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config - name: prometheus-rules configMap: name: prometheus-rules......
怎加alertmanager的告警配置
[root@localhost prometheus]# cat alertmanager-configmap.yaml apiVersion: v1kind: ConfigMapmetadata: name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExistsdata: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xxxxx@163.com' smtp_auth_username: 'xxxxx@163.com' smtp_auth_password: 'xxxxx' receivers: - name: default-receiver email_configs: - to: "xxxxx@qq.com" route: group_interval: 1m group_wait: 10s receiver: default-receiver repeat_interval: 1m [root@localhost prometheus]# kubectl apply -f alertmanager-configmap.yamlconfigmap/alertmanager-config configured
邮件告警
本文系转载,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文系转载,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。