前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >夜莺随笔:监控 Linux 主机

夜莺随笔:监控 Linux 主机

作者头像
IT小白Kasar
发布2022-02-16 18:52:50
2.9K0
发布2022-02-16 18:52:50
举报
文章被收录于专栏:个人技术随笔个人技术随笔

前面对夜莺的安装方法做了一些探讨,接下来就进入使用的阶段。

正文

本文环境

  • 夜莺 v5.3
  • node_exporter 1.3.1
  • telegraf 1.21.3
  • CentOS 7.9

node-exporter 部分

node-exporter 是 promethues 官方的采集器,其安装方法非常简单。

下载 node-exporter 包

由于 github 国内访问有时候容易出现重置,所以采用南京大学的源。

代码语言:javascript
复制
wget https://s3.jcloud.sjtu.edu.cn/899a892efef34b1b944a19981040f55b-oss01/github-release/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
解压 node-exporter 压缩包

最后得到一个二进制文件。

代码语言:javascript
复制
mkdir /opt/node_exporter
mv node_exporter-1.3.1.linux-amd64.tar.gz /opt/node_exporter
cd /opt/node_exporter/
tar xzvf node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64/
运行 node-exporter

出现 Listening on 字眼即为运行正常

代码语言:javascript
复制
./node_exporter 
Promethues 配置

找到 prometheus.yml,这里由于每个人的环境不一样,所以文件所在地址也不一样,这里只演示配置,最后需要注意的是格式问题。

代码语言:javascript
复制
- job_name: "local"
    static_configs:
      - targets: ["10.240.99.198:9100"]
Prometheus 配置热刷新
代码语言:javascript
复制
curl -X POST http://127.0.0.1:9090/-/reload
配置 node_expoter systemd 守护
代码语言:javascript
复制
mkdir /usr/local/node_exporter
mv /opt/node_exporter/node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/node_exporter/
代码语言:javascript
复制
[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
启动node_exporter
代码语言:javascript
复制
systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
systemctl status node_exporter
状态为 active 即为正常
状态为 active 即为正常

需要注意的是,node_exporter 采集的数据在夜莺里无法看到对象列表里看到,只能在即时查询里看到数据,想要看到资源列表只能通过 telegraf 的方式监控。

无法看到主机资源
无法看到主机资源
即时查询可以看到数据
即时查询可以看到数据

telegraf 部分

Telegraf 是个 all-in-one 的架构,一个二进制可以搞定机器、网络设备、中间件、数据库、Statsd 等各种采集能力,相比散落的各类 Exporter 而言,维护成本更低一些,Telegraf 支持通过 OpenTSDB 这个 output plugin 来对接夜莺。

下载 telegraf rpm 包
代码语言:javascript
复制
wget https://mirrors.nju.edu.cn/influxdata/yum/el8-x86_64/telegraf-1.21.3-1.x86_64.rpm
安装 telegraf
代码语言:javascript
复制
yum localinstall telegraf-1.21.3-1.x86_64.rpm -y
修改 telegraf 配置

清空原有配置,贴上下面配置,需要修改的地方为 host 和 port,根据自身情况填写。

代码语言:javascript
复制
[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.opentsdb]]
  host = "http://10.240.99.198"
  port = 19000
  http_batch_size = 50
  http_path = "/opentsdb/put"
  debug = false
  separator = "_"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = true

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.system]]
  fielddrop = ["uptime_format"]

[[inputs.net]]
  ignore_protocol_stats = true
重启 telegraf
代码语言:javascript
复制
service telegraf restart
systemctl enable telegraf
查看夜莺前端

此时可以看到未归组对象里有刚刚启动 telegraf 的主机了。并且在监控看图 –> 对象视角里看到相对应的监控指标。

可以看到主机资源
可以看到主机资源
对象视角可以看到相关指标
对象视角可以看到相关指标
导入官方监控大盘

进入到监控大盘里,点击导入

代码语言:javascript
复制
[
  {
    "name": "Linux基本监控指标-Telegraf采集",
    "tags": "HOST",
    "configs": "{\"var\":[{\"name\":\"host\",\"definition\":\"label_values(mem_used_percent, ident)\"}]}",
    "chart_groups": [
      {
        "name": "Default chart group",
        "weight": 0,
        "charts": [
          {
            "configs": "{\"name\":\"整机CPU空闲率(%)\",\"QL\":[{\"PromQL\":\"cpu_usage_idle{cpu=\\\"cpu-total\\\", ident=\\\"$host\\\"}\"}],\"yplotline1\":35,\"yplotline2\":15,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"asc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":0,\"y\":0,\"i\":\"0\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"内存可用率(%)\",\"QL\":[{\"PromQL\":\"mem_available_percent{ident=\\\"$host\\\"}\"}],\"yplotline1\":30,\"yplotline2\":15,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"asc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":8,\"y\":0,\"i\":\"1\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"硬盘利用率(%)\",\"QL\":[{\"PromQL\":\"disk_used_percent{ident=\\\"$host\\\"}\"}],\"yplotline1\":87,\"yplotline2\":92,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":16,\"y\":0,\"i\":\"2\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"IO.UTIL(%)\",\"QL\":[{\"PromQL\":\"rate(diskio_io_time{ident=\\\"$host\\\"}[1m])/10\"}],\"yplotline1\":90,\"yplotline2\":null,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":0,\"y\":2,\"i\":\"3\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"网卡每分钟丢包数(个)\",\"QL\":[{\"PromQL\":\"increase(net_drop_in{ident=\\\"$host\\\"}[1m])\",\"Legend\":\"net_drop_in ident:{{ident}} interface:{{interface}}\"},{\"PromQL\":\"increase(net_drop_out{ident=\\\"$host\\\"}[1m])\",\"Legend\":\"net_drop_out ident:{{ident}} interface:{{interface}}\"}],\"yplotline1\":5,\"yplotline2\":20,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"short\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":8,\"y\":2,\"i\":\"4\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"TCP_TIME_WAIT数量\",\"QL\":[{\"PromQL\":\"netstat_tcp_time_wait{ident=\\\"$host\\\"}\"}],\"yplotline1\":null,\"yplotline2\":20000,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"short\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":16,\"y\":2,\"i\":\"5\"}}",
            "weight": 0
          }
        ]
      }
    ]
  }
]
点击导入
点击导入
贴入上面配置至大盘 JSON 下的空白处
贴入上面配置至大盘 JSON 下的空白处
导入成功
导入成功
效果
效果
最终效果
最终效果

附录

Linux 常用告警规则

代码语言:javascript
复制
[
  {
    "name": "有地址PING不通,请注意",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "ping_result_code != 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "有监控对象失联",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "target_up != 1",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "有端口探测失败,请注意",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "net_response_result_code != 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "机器负载-CPU较高,请关注",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "cpu_usage_idle{cpu=\"cpu-total\"} < 25",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "机器负载-内存较高,请关注",
    "note": "",
    "severity": 2,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "mem_available_percent < 25",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "硬盘-IO非常繁忙",
    "note": "",
    "severity": 2,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "rate(diskio_io_time[1m])/10 > 99",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "硬盘-预计再有4小时写满",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "predict_linear(disk_free[1h], 4*3600) < 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "网卡-入向有丢包",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "increase(net_drop_in[1m]) > 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "网卡-出向有丢包",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "increase(net_drop_out[1m]) > 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "网络连接-TME_WAIT数量超过2万",
    "note": "",
    "severity": 2,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "netstat_tcp_time_wait > 20000",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "进程监控-有进程数为0,某进程可能挂了",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "procstat_lookup_running == 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "进程监控-进程句柄限制过小",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "procstat_rlimit_num_fds_soft < 2048",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "进程监控-采集失败",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "procstat_lookup_result_code != 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  }
]
导入告警规则
导入告警规则
贴入上面配置至告警规则 JSON 下空白处
贴入上面配置至告警规则 JSON 下空白处
效果
效果

写在最后

到了这里基本就介绍完了,整体看下来有两个结论,如果采用 exporter 为采集器,那么夜莺仅仅是充当一个类 grafana 的功能,也就是查询,如果采用 telegraf 为采集器,那么就是正常的监控应用,后面会围绕 telegraf 插件来展开

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2022-02-07 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 正文
    • 本文环境
      • node-exporter 部分
        • 下载 node-exporter 包
        • 解压 node-exporter 压缩包
        • 运行 node-exporter
        • Promethues 配置
        • Prometheus 配置热刷新
        • 配置 node_expoter systemd 守护
        • 启动node_exporter
      • telegraf 部分
        • 下载 telegraf rpm 包
        • 安装 telegraf
        • 修改 telegraf 配置
        • 重启 telegraf
        • 查看夜莺前端
        • 导入官方监控大盘
    • 附录
      • Linux 常用告警规则
      • 写在最后
      相关产品与服务
      消息队列 TDMQ
      消息队列 TDMQ (Tencent Distributed Message Queue)是腾讯基于 Apache Pulsar 自研的一个云原生消息中间件系列,其中包含兼容Pulsar、RabbitMQ、RocketMQ 等协议的消息队列子产品,得益于其底层计算与存储分离的架构,TDMQ 具备良好的弹性伸缩以及故障恢复能力。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档