Prometheus Operator 是 CoreOS 开发的基于 Prometheus 的 Kubernete s监控方案,也可能是目前功能最全面的开源方案。更多信息可以查看https://github.com/coreos/prometheus-operator
为方便管理,创建一个单独的 Namespace monitoring,Prometheus Operator 相关的组件都会部署到这个 Namespace。
# kubectl create namespace monitoring
所有节点上面导入 prometheus-operator.tar,下载地址:prometheus-operator.tar
# docker load -i prometheus-operator.tar
Prometheus Operator 所有的组件都打包成 Helm Chart,安装部署非常方便。
# helm install --name prometheus-operator --namespace=monitoring stable/prometheus-operator
# kubectl get all -n monitoring
NAME READY STATUS RESTARTS AGE
pod/alertmanager-prometheus-operator-alertmanager-0 2/2 Running 0 60s
pod/prometheus-operator-grafana-6c8f4bcfb4-jp5bh 3/3 Running 0 65s
pod/prometheus-operator-kube-state-metrics-6b6d6b8bbd-gff7j 1/1 Running 0 65s
pod/prometheus-operator-operator-76f78fd685-295rb 1/1 Running 0 65s
pod/prometheus-operator-prometheus-node-exporter-44tgz 1/1 Running 0 65s
pod/prometheus-operator-prometheus-node-exporter-6t4sc 1/1 Running 0 65s
pod/prometheus-operator-prometheus-node-exporter-vnwrv 1/1 Running 0 65s
pod/prometheus-prometheus-operator-prometheus-0 3/3 Running 1 54s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 60s
service/prometheus-operated ClusterIP None <none> 9090/TCP 54s
service/prometheus-operator-alertmanager ClusterIP 10.105.62.219 <none> 9093/TCP 65s
service/prometheus-operator-grafana ClusterIP 10.103.30.59 <none> 80/TCP 65s
service/prometheus-operator-kube-state-metrics ClusterIP 10.105.189.63 <none> 8080/TCP 65s
service/prometheus-operator-operator ClusterIP 10.105.212.90 <none> 8080/TCP 65s
service/prometheus-operator-prometheus ClusterIP 10.104.229.158 <none> 9090/TCP 65s
service/prometheus-operator-prometheus-node-exporter ClusterIP 10.103.226.249 <none> 9100/TCP 65s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/prometheus-operator-prometheus-node-exporter 3 3 3 3 3 <none> 65s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-operator-grafana 1/1 1 1 65s
deployment.apps/prometheus-operator-kube-state-metrics 1/1 1 1 65s
deployment.apps/prometheus-operator-operator 1/1 1 1 65s
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-operator-grafana-6c8f4bcfb4 1 1 1 65s
replicaset.apps/prometheus-operator-kube-state-metrics-6b6d6b8bbd 1 1 1 65s
replicaset.apps/prometheus-operator-operator-76f78fd685 1 1 1 65s
NAME READY AGE
statefulset.apps/alertmanager-prometheus-operator-alertmanager 1/1 60s
statefulset.apps/prometheus-prometheus-operator-prometheus 1/1 54s
# helm list
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
prometheus-operator 1 Tue Jan 8 13:49:12 2019 DEPLOYED prometheus-operator-1.5.1 0.26.0 monitoring
prometheus-operator 的 charts 会自动安装 Prometheus、Alertmanager 和 Grafana。
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 7m30s
prometheus-operated ClusterIP None <none> 9090/TCP 7m24s
prometheus-operator-alertmanager ClusterIP 10.105.62.219 <none> 9093/TCP 7m35s
prometheus-operator-grafana ClusterIP 10.103.30.59 <none> 80/TCP 7m35s
prometheus-operator-kube-state-metrics ClusterIP 10.105.189.63 <none> 8080/TCP 7m35s
prometheus-operator-operator ClusterIP 10.105.212.90 <none> 8080/TCP 7m35s
prometheus-operator-prometheus ClusterIP 10.104.229.158 <none> 9090/TCP 7m35s
prometheus-operator-prometheus-node-exporter ClusterIP 10.103.226.249 <none> 9100/TCP 7m35s
默认的访问类型为 ClusterIP 无法外部访问,只能集群内访问。
grafana:
# kubectl edit svc prometheus-operator-grafana -n monitoring
……
spec:
clusterIP: 10.103.30.59
ports:
- name: service
port: 80
protocol: TCP
targetPort: 3000
selector:
app: grafana
release: prometheus-operator
sessionAffinity: None
type: NodePort #修改此行
alertmanager:
# kubectl edit svc prometheus-operator-alertmanager -n monitoring
……
spec:
clusterIP: 10.105.62.219
ports:
- name: web
port: 9093
protocol: TCP
targetPort: 9093
selector:
alertmanager: prometheus-operator-alertmanager
app: alertmanager
sessionAffinity: None
type: NodePort #修改此行
status:
loadBalancer: {}
prometheus:
# kubectl edit svc prometheus-operator-prometheus -n monitoring
……
spec:
clusterIP: 10.104.229.158
ports:
- name: web
port: 9090
protocol: TCP
targetPort: web
selector:
app: prometheus
prometheus: prometheus-operator-prometheus
sessionAffinity: None
type: NodePort #修改此行
status:
loadBalancer: {}
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 23m
prometheus-operated ClusterIP None <none> 9090/TCP 23m
prometheus-operator-alertmanager NodePort 10.105.62.219 <none> 9093:32645/TCP 23m
prometheus-operator-grafana NodePort 10.103.30.59 <none> 80:30043/TCP 23m
prometheus-operator-kube-state-metrics ClusterIP 10.105.189.63 <none> 8080/TCP 23m
prometheus-operator-operator ClusterIP 10.105.212.90 <none> 8080/TCP 23m
prometheus-operator-prometheus NodePort 10.104.229.158 <none> 9090:32275/TCP 23m
prometheus-operator-prometheus-node-exporter ClusterIP 10.103.226.249 <none> 9100/TCP 23m
prometheus 需要访问 kubelet 的 10255 端口获取 metrics。但是默认情况下 10255 端口是不开放的,会导致 prometheus 上有 unhealthy,如下图:
打开只读端口需要编辑所有节点的 /var/lib/kubelet/config.yaml 文件,加入以下内容
# /var/lib/kubelet/config.yaml
……
oomScoreAdj: -999
podPidsLimit: -1
port: 10250
readOnlyPort: 10255 #增加此行
registryBurst: 10
registryPullQPS: 5
resolvConf: /etc/resolv.conf
重启 kubelet 服务
# systemctl restart kubelet.service
查看 prometheus target
问题详情见:Don’t scrape metrics from coreDNS 解决方法如下:修改 prometheus-operator-coredns 服务的 selector 为 kube-dns
# kubectl edit svc prometheus-operator-coredns -n kube-system
……
spec:
clusterIP: None
ports:
- name: http-metrics
port: 9153
protocol: TCP
targetPort: 9153
selector:
k8s-app: kube-dns #修改此行
sessionAffinity: None
type: ClusterIP
prometheus 通过 4001 端口访问 etcd metrics,但是 etcd 默认监听 2379。 解决方法如下:
# vim /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
k8s-app: etcd-server #增加此行
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://172.20.6.116:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://172.20.6.116:2380
- --initial-cluster=k8s-master=https://172.20.6.116:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://172.20.6.116:2379,http://172.20.6.116:4001 #增加 4001 端口的 http 监听
- --listen-peer-urls=https://172.20.6.116:2380
- --name=k8s-master
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
重启 kubelet 服务即可
# systemctl restart kubelet.service
由于 kube-controller-manager 和 kube-scheduler 默认监听 127.0.0.1 ,prometheus 无法通过本机地址获取数据,需要修改kube-controller-manager 和 kube-scheduler 监听地址。 解决办法如下: kube-controller-manager:
# vim /etc/kubernetes/manifests/kube-controller-manager.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
k8s-app: kube-controller-manager #增加此行
component: kube-controller-manager
tier: control-plane
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- command:
- kube-controller-manager
- --address=0.0.0.0 #修改监听地址
- --allocate-node-cidrs=true
kube-scheduler:
# vim /etc/kubernetes/manifests/kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
k8s-app: kube-scheduler #增加此行
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --address=0.0.0.0 #修改监听地址
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
重启 kubelet 服务即可
# systemctl restart kubelet.service