前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Kubernetes相关组件监控指标采集

Kubernetes相关组件监控指标采集

作者头像
洗尽了浮华
发布2019-05-25 16:51:05
2.4K0
发布2019-05-25 16:51:05
举报
文章被收录于专栏:散尽浮华散尽浮华

线上部署了kuberneter集群环境,需要在zabbix上对相关组件运行情况进行监控。kuberneter组件监控指标分为固定指标数据采集和动态指标数据采集。其中,固定指标数据在终端命令行可以通过metrics接口获取, 在zabbix里"自动发现";动态指标数据通过python脚本获获取,并返回JSON 字符串格式,在zabbix里添加模板或配置主机的自动发现策略。

一、固定指标数据采集(zabbix自动发现,采集间隔建议5min)

1. Master指标【采集范围:Master集群的3个节点,测试环境为192.168.10.93/94/95】

代码语言:javascript
复制
1、指标标识:kube_apiserver_process_cpu_seconds_total
采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem  https://192.168.10.93:6443/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

2、指标标识:kube_apiserver_process_open_fds
采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem  https://192.168.10.93:6443/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

3、指标标识:kube_apiserver_process_virtual_memory_bytes
采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem  https://192.168.10.93:6443/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

4、指标标识:kube_apiserver_rest_client_requests_total_200_put
采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem  https://192.168.10.93:6443/metrics | grep rest_client_requests_total | grep -v '#' | grep PUT | grep 200 | awk '{print $2}'

5、指标标识:kube_apiserver_rest_client_requests_total_200_get
采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem  https://192.168.10.93:6443/metrics | grep rest_client_requests_total | grep -v '#' | grep GET | grep 200 | awk '{print $2}'

6、指标标识:etcd_debugging_mvcc_db_total_size_in_bytes
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep etcd_debugging_mvcc_db_total_size_in_bytes | grep -v '#' | awk '{print $2}'

7、指标标识:etcd_server_has_leader
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep etcd_server_has_leader | grep -v '#' | awk '{print $2}'

8、指标标识:etcd_server_leader_changes_seen_total
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep etcd_server_leader_changes_seen_total | grep -v '#' | awk '{print $2}'

9、指标标识:etcd_server_proposals_failed_total
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep etcd_server_proposals_failed_total | grep -v '#' | awk '{print $2}'

10、指标标识:etcd_process_cpu_seconds_total
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

11、指标标识:etcd_process_open_fds
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

12、指标标识:etcd_process_virtual_memory_bytes
采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem  https://192.168.10.93:2379/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

13、指标标识:kube_controller_manager_process_cpu_seconds_total
采集指令示例:curl -s 192.168.10.93:10252/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

14、指标标识:kube_controller_manager_process_open_fds
采集指令示例:curl -s 192.168.10.93:10252/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

15、指标标识:kube_controller_manager_process_virtual_memory_bytes
采集指令示例:curl -s 192.168.10.93:10252/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

16、指标标识:kube_controller_manager_rest_client_requests_total_200_put
采集指令示例:curl -s 192.168.10.93:10252/metrics | grep rest_client_requests_total | grep -v '#' | grep PUT | grep 200 | awk '{print $2}'

17、指标标识:kube_controller_manager_rest_client_requests_total_200_get
采集指令示例:curl -s 192.168.10.93:10252/metrics | grep rest_client_requests_total | grep -v '#' | grep GET | grep 200 | awk '{print $2}'

18、指标标识:kube_scheduler_process_cpu_seconds_total
采集指令示例:curl -s 192.168.10.93:10251/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

19、指标标识:kube_scheduler_process_open_fds
采集指令示例:curl -s 192.168.10.93:10251/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

20、指标标识:kube_scheduler_process_virtual_memory_bytes
采集指令示例:curl -s 192.168.10.93:10251/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

21、指标标识:kube_scheduler_rest_client_requests_total_200_put
采集指令示例:curl -s 192.168.10.93:10251/metrics | grep rest_client_requests_total | grep -v '#' | grep PUT | grep 200 | awk '{print $2}'

22、指标标识:kube_scheduler_rest_client_requests_total_200_get
采集指令示例:curl -s 192.168.10.93:10251/metrics | grep rest_client_requests_total | grep -v '#' | grep GET | grep 200 | awk '{print $2}'

2. Node指标【采集范围:Node的5个节点,测试环境为192.168.10.230/231/232/233/234】

代码语言:javascript
复制
1、指标标识:kubelet_docker_operations_errors_inspect_container
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v '#' | grep inspect_container | awk '{print $2}'

2、指标标识:kubelet_docker_operations_errors_inspect_image
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v '#' | grep inspect_image | awk '{print $2}'

3、指标标识:kubelet_docker_operations_errors_start_container
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v '#' | grep start_container | awk '{print $2}'

4、指标标识:kubelet_docker_operations_errors_stop_container
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v '#' | grep stop_container | awk '{print $2}'

5、指标标识:kubelet_node_config_error
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_node_config_error | grep -v '#' | awk '{print $2}'

6、指标标识:kubelet_process_cpu_seconds_total
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

7、指标标识:kubelet_process_open_fds
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

8、指标标识:kubelet_process_virtual_memory_bytes
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

9、指标标识:kubelet_rest_client_requests_total_200_put
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep rest_client_requests_total | grep -v '#' | grep PUT | grep 200 | awk '{print $2}'

10、指标标识:kubelet_rest_client_requests_total_200_get
采集指令示例:curl -s 192.168.10.230:10255/metrics | grep rest_client_requests_total | grep -v '#' | grep GET | grep 200 | awk '{print $2}'

11、指标标识:kube_proxy_process_cpu_seconds_total
采集指令示例:curl -s 192.168.10.230:10249/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

12、指标标识:kube_proxy_process_open_fds
采集指令示例:curl -s 192.168.10.230:10249/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

13、指标标识:kube_proxy_process_virtual_memory_bytes
采集指令示例:curl -s 192.168.10.230:10249/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

14、指标标识:kube_proxy_rest_client_requests_total_200_put
采集指令示例:curl -s 192.168.10.230:10249/metrics | grep rest_client_requests_total | grep -v '#' | grep PUT | grep 200 | awk '{print $2}'

15、指标标识:kube_proxy_rest_client_requests_total_200_get
采集指令示例:curl -s 192.168.10.230:10249/metrics | grep rest_client_requests_total | grep -v '#' | grep GET | grep 200 | awk '{print $2}'

3. 整体指标【采集Node集群中任一节点即可,测试环境可采集其中一台192.168.10.230即可。 在采集对应node节点的指标数据中,如果node节点宕机,则监控指标数据就会失败。为了防止这种情况,采集的IP可以建议修改为Nginx-Ingress IP或内部Service IP

代码语言:javascript
复制
1、指标标识:coredns_process_cpu_seconds_total
采集指令示例:curl -s 192.168.10.230:9153/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

2、指标标识:coredns_process_open_fds
采集指令示例:curl -s 192.168.10.230:9153/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

3、指标标识:coredns_process_virtual_memory_bytes
采集指令示例:curl -s 192.168.10.230:9153/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

4、指标标识:kube_state_metrics_metrics_process_cpu_seconds_total
采集指令示例:curl -s 192.168.10.230:8081/metrics | grep process_cpu_seconds_total | grep -v '#' | awk '{print $2}'

5、指标标识:kube_state_metrics_metrics_process_open_fds
采集指令示例:curl -s 192.168.10.230:8081/metrics | grep process_open_fds | grep -v '#' | awk '{print $2}'

6、指标标识:kube_state_metrics_metrics_process_virtual_memory_bytes
采集指令示例:curl -s 192.168.10.230:8081/metrics | grep process_virtual_memory_bytes | grep -v '#' | awk '{print $2}'

二、固定指标数据采集

动态指标采集的python脚本(将各个动态指标数据采集脚本整合到了一个脚本里)

代码语言:javascript
复制
[root@bz4ccs001ap1001 ~]# cat zabbix-metrics-find.py 
#!/usr/bin/env python
# coding:utf-8

import json
import os
import re
import sys

#kube-state-metrics自动发现for zabbix
#python传参value/values(不区分大小写)时显示监控值,其他参数或无参数显示监控KEY
#采集范围:任一Node节点,测试可在192.168.10.230,此IP后续建议改为Nginx-Ingress的负载IP,或内部service IP
#采集间隔建议5min
#Author: GaoKan
#Created: 2019-5-22
#Updated:
def main():
    ip = '192.168.10.230'
    flag = 'key'
    if len(sys.argv) > 1:
        if sys.argv[1].lower() in ('value', 'values'):
            flag = 'value'
    keys = []
    values = []    
    metrics_dict = {
        #DaemonSet-Metrics
        'kube_daemonset_status_number_misscheduled' : {
            'forshort' : 'ds_misscheduled',
            'tags' : ['namespace', 'daemonset',],            
        },
        'kube_daemonset_status_number_unavailable' : {
            'forshort' : 'ds_unavailable',
            'tags' : ['namespace', 'daemonset',],
        },
        #Deployment-Metrics
        'kube_deployment_status_replicas_unavailable' : {
            'forshort' : 'deploy_unavailable',
            'tags' : ['namespace', 'deployment',],
        },
        #Pod-Metrics
        'kube_pod_container_status_waiting_reason' : {
            'forshort' : 'po_cntr_waiting_reason',
            'tags' : ['namespace', 'pod', 'container', 'reason',],
        },
        'kube_pod_container_status_terminated_reason' : {
            'forshort' : 'po_cntr_terminated_reason',
            'tags' : ['namespace', 'pod', 'container', 'reason',],
        },
        'kube_pod_container_status_restarts_total' : {
            'forshort' : 'po_cntr_restarts_total',
            'tags' : ['namespace', 'pod', 'container',],
        },
        #ReplicaSet-Metrics
        'kube_replicaset_status_ready_replicas' : {
            'forshort' : 'rs_ready_replicas',
            'tags' : ['namespace', 'replicaset',],           
        },
        'kube_replicaset_status_replicas' : {
            'forshort' : 'rs_replicas',
            'tags' : ['namespace', 'replicaset',],  
        },
        #Endpoint-Metrics
        'kube_endpoint_address_not_ready' : {
            'forshort' : 'ep_not_ready',
            'tags' : ['namespace', 'endpoint',],                       
        },
    }   
    metrics = os.popen('curl -s ' + ip + ':8080/metrics')   
    for row in metrics:
        if row.startswith('#'):
            continue
        pos1 = row.find('{')
        pos2 = row.find('}')
        if row[: pos1] in metrics_dict.keys():
            key = metrics_dict[row[: pos1]]['forshort']
            for tag in metrics_dict[row[: pos1]]['tags']:
                key += '_' + re.search(r'' + tag + '=\"(.*?)\"', row[pos1 + 1 : pos2]).group(1)
            keys.append({"{#METRICSNAME}" : key})
            values.append({"{#METRICSVALUE}" : row[pos2 + 2 : -1]})                    
    if flag == 'value':
        print(json.dumps({"data":values},indent = 4))
    else:
        print(json.dumps({"data":keys},indent = 4))

if __name__ == "__main__":
    main()

执行脚本,返回json字符串格式(执行结果显示的是kubernetes所有的对象资源,如pod,deploy,service等的运行状态,根据跑的业务量,可能会有成百上千个)

代码语言:javascript
复制
[root@bz4ccs001ap1001 ~]# python zabbix-metrics-find.py 
{
    "data": [
        {
            "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-005"
        }, 
        {
            "{#METRICSNAME}": "ds_misscheduled_cattle-system_cattle-node-agent"
        }, 
        {
            "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-001"
        }, 
        {
            "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-002"
        }, 
        {
            "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-003"
        }, 
        {
            "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-004"
        }, 
        {
            "{#METRICSNAME}": "ds_unavailable_test-rg_test-rg-003"
        }, 
        {
            "{#METRICSNAME}": "ds_unavailable_test-rg_test-rg-004"
        }, 
        {
            "{#METRICSNAME}": "ds_unavailable_test-rg_test-rg-005"
        }, 
...................
...................
       {
            "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-005-jvkm6_test-rg-005"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_cattle-system_cattle-node-agent-mdl9x_agent"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-005-wpsbq_test-rg-005"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-004-9s57x_test-rg-004"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-005-wxk54_test-rg-005"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_cattle-system_cattle-node-agent-r46bz_agent"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_default_mysql-ceph-test-76697d98d6-4gj9v_mysql-ceph-test"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_kube-system_coredns-5cbf6655f-6wxqz_coredns"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_kube-system_kube-state-metrics-576fbb446d-ctl4p_addon-resizer"
        }, 
        {
            "{#METRICSNAME}": "po_cntr_restarts_total_kube-system_kube-state-metrics-576fbb446d-ctl4p_kube-state-metrics"
        },

...................
...................
        {
            "{#METRICSNAME}": "rs_ready_replicas_test_nginx-5c689d88bb"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_two-test_aicase-docker-5784b5749b"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_cattle-system_cattle-cluster-agent-d59dbdb55"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_test_nginx-589dcbcbd6"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_test_nginx-5b677cdf4f"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_default_mysql-ceph-test-76697d98d6"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_kube-system_kube-state-metrics-75bbc44548"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_kube-system_traefik-ingress-controller-6db4877748"
        }, 
        {
            "{#METRICSNAME}": "rs_ready_replicas_two-test_aicase-docker-57d445cbf"
        }
    ]
}

查询values

代码语言:javascript
复制
[root@bz4ccs001ap1001 ~]# python zabbix-metrics-find.py values
{
    "data": [
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
.................
.................
        {
            "{#METRICSVALUE}": "1"
        }, 
        {
            "{#METRICSVALUE}": "27"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "3"
        }, 
        {
            "{#METRICSVALUE}": "0"
        },
.................
.................
        {
            "{#METRICSVALUE}": "1"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "2"
        }, 
        {
            "{#METRICSVALUE}": "1"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }, 
        {
            "{#METRICSVALUE}": "2"
        }, 
        {
            "{#METRICSVALUE}": "0"
        }
    ]
}

在被监控节点上配置zabbix_agent

代码语言:javascript
复制
UserParameter 填写的是key值,这个可以随便定义
后面跟上监控脚本的执行过程
[root@bz4ccs001ap1001 ~]# cat /etc/zabbix/zabbix_agentd.d/ceshi.conf 
UserParameter=k8s.autofind.metrics,/bin/python /root/zabbix-metrics-find.py

查询键值(key)
[root@bz4ccs001ap1001 ~]# zabbix_agentd -t k8s.autofind.metrics
这个命令查询出来的结果和上面执行"python zabbix-metrics-find.py"出来的结果一样

===================================================================================
然后登录zabbix,在"配置"->"主机"->"自动发现", 添加自动发现即可
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-05-24 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
容器服务
腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档