运维必看！Prometheus 安装部署与配置解析

Python运维开发

发布于 2025-09-29 14:15:58

6700

文章被收录于专栏：Python运维开发Python运维开发

Prometheus 是一个开源的系统监控，它通过 HTTP 协议或取系统指标数据并将这些数据存储在本地的时间序列数据库中。Prometheus 内置 PromQL方便查询存储指标数据，支持创建图表和警报规则。

Prometheus Server：负责指标抓取、存储、查询及告警触发。
Exporter：暴露被监控服务的指标接口（如 Node Exporter,MySQL Exporter等）。

1. 安装 Prometheus

软件下载地址：

https://prometheus.io/download/

1.1 下载 Prometheus

我用的版本比较老这里仅作参考，建议下载新版本二进制包

# 下载 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.3.0/prometheus-2.3.0.linux-amd64.tar.gz
# 解压文件
tar -zxvf prometheus-2.3.0.linux-amd64.tar.gz
cd prometheus-2.3.0.linux-amd64

1.2 目录结构

prometheus:启动执行文件
promtool: 命令行工具，用于验证配置文件和规则
prometheus.yml: 主配置文件
rules&target: 需要手动创建存放告警规则文件和监控node的配置文件 ll rules/

ll target/node/

consoles/: 存放 Web 控制台模板

1.3 启动 Prometheus

# 启动 Prometheus
nohup ./prometheus --config.file=prometheus.yml \
--web.enable-lifecycle \
--web.listen-address=192.168.1.139:8001 &

Prometheus 默认端口9090`我这里用的是8001，浏览器访问地址` http://192.168.1.139:8001

2、Node Exporter 安装（监控主机）

下载地址：

cd /opt/soft/
#解压 
tar -zxvf  node_exporter-1.3.1.linux-amd64.tar.gz
#进入安装目录 
cd /opt/soft/node_exporter-1.3.1.linux-amd64
#启动 （collector.textfile.directory 存放自定义指标数据最后会有使用说明）
nohup ./node_exporter --collector.textfile.directory=./key &

访问 `http://127.0.0.1:9100/metrics`，若返回类似以下内容，说明 Node Exporter 已正常工作：

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.6425e-05
go_gc_duration_seconds{quantile="0.25"} 3.4856e-05
go_gc_duration_seconds{quantile="0.5"} 5.8672e-05
go_gc_duration_seconds{quantile="0.75"} 8.4572e-05
go_gc_duration_seconds{quantile="1"} 0.000452457
go_gc_duration_seconds_sum 12.595550358
go_gc_duration_seconds_count 185069
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 7
# HELP go_info Information about the Go environment.

下面我们开始解析配置和使用Prometheus

3. 配置 Prometheus

Prometheus 的配置文件是 YAML 格式的

3.1 配置解析

全局配置

# my global config
global:
  scrape_interval:     15s # 抓取（scrape）间隔时间， 默认是1分钟
  evaluation_interval: 15s # 规则评估间隔默认也是 1 分钟

Alertmanager配置

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093 #这里配置Alertmanager的地址

3.2 规则配置 rule files是支持自定义的，为了清晰区分不同的告警规则，可以不同的规则文件 作用：定义触发报警的规则条件，prometheus 会加载rules下的所有规则文件里定义的规则条件

# 加载规则一次，并根据全局“evaluation_interval”定期对其进行评估
rule_files:
    - /data/prometheus-2.3.0.linux-amd64/rules/*.rules 
    - /data/prometheus-2.3.0.linux-amd64/rules/*.yml
#*.yml、*.rules代表所有此后缀的文件如1.rules、2.rules

3.3 端点以及被监控的node_export配置 为了方便维护，可以以不同的项目，或者按业务类型，创建不同的目录为每个应用创建单独的配置文件

#需要注意下metrics_path，node_export默认是/metrics可以不写，这里例子监控的是java后端,java接入prometheus指标端点是

/actuator/prometheus

所以就配置的/actuator/prometheus

# 获取指标数据就这样访问

“http://127.0.0.1:8080/actuator/prometheus”

- job_name: 'application'  #这个任务下的指标都会自动加一个标签 `job="application"`    
  metrics_path: /actuator/prometheus
  #file_sd_configs文件服务发现"机制来动态发现监控目标
  #优点：新增或者关闭监控目标不用重启Prometheus    
  file_sd_configs: 
  - files: #定义被监控的node_export配置文件存放路径
    - "/data/app/monitor/prometheus-2.3.0.linux-amd64/target/app/*.json"
    refresh_interval: 6s # 文件发现服务刷新时间

3.4 以下是一份完整的配置文件

global:
  scrape_interval:     15s 
  evaluation_interval: 15s


alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

rule_files:
    - /data/prometheus-2.3.0.linux-amd64/rules/*.rules 
    - /data/prometheus-2.3.0.linux-amd64/rules/*.yml
scrape_configs:
  #prometheus自己
  - job_name: 'prometheus'

    static_configs:
    - targets: ['192.168.1.139:8001']
      labels:
          instance: prometheus
  - job_name: 'node' #存放node-export的配置
    file_sd_configs:
    - files:
      - "/data/prometheus-2.3.0.linux-amd64/target/node/*.json"
      refresh_interval: 6s
  - job_name: 'application' #后端业务应用的配置 
    metrics_path: /actuator/prometheus
    file_sd_configs:
    - files:
      - "/data/prometheus-2.3.0.linux-amd64/target/application/*.json"
      refresh_interval: 6s

3.5 在服务端增加被监控端的配置文件

#targets 目标主机地址
#labels 监控目标主机的标签，结合altermannger通知模版这些标签可以展示到告警通知的内容里 
[root@test node]# cat 192.168.1.139.json
[
  {
    "targets":  ["192.168.1.139:9100"], 
    "labels": {
        "env": "test",
        "servicename": "测试",
        "hostname": "测试机"
  }
}
]
创建好后将文件放在/data/prometheus-2.3.0.linux-amd64/target/application/
下，prometheus定时扫描会自动加载被监控端的配置

`访问http://192.168.1.139:8001, 看到192.168.1.139:9100已经加过来了`

3.6 规则文件（这是一个自定义的规则例子结合脚本实现的监控）

[root@test rules]# cat zidingyi.rules 
groups: # 规则组
- name: port # 规则组的名称
  - alert: nexus(私服) #告警名称
    expr: nexus == 0
   # for: 1m #持续多久后发送 # 不设置持续时间，默认为0会直接 Inactive 状态转换成 Firing状态，然后触发警报
    labels:
      severity: "紧急"
    annotations: #信息
      summary: "端口不通"
      description: "模版测试收到请忽略"
#将文件zidingyi.rules放到/data/prometheus-2.3.0.linux-amd64/rules/
#路径下

这里我们就以监控nexus为例子在

在node_exporter端自定义脚本

cat /opt/soft/node_exporter-1.3.1.linux-amd64/key/key_runner.sh
#!/bin/bash
echo "nexus" `netstat -tunlp|grep 8082|wc -l` 
#当8082不存在时会打印0  
#我们监控规则里定义 expr: nexus == 0 时发送通知

增加crontab定时任务

*/1 * * * * /bin/bash /opt/soft/node_exporter-1.3.1.linux-amd64/key/key_runner.sh >/opt/soft/node_exporter-1.3.1.linux-amd64/key/key.prom

这里需注意我们用脚本自定义指标数据写入key.prom，格式是

nexus 1 或者 nexus 0 
#告警规则里会根据0 1区分是否发送通知

node_export会收集启动时要指定collector.textfile.directory 注：本文内容为技术经验总结仅供参考。实际部署时，请结合自身业务场景、环境特性及安全要求调整配置，操作前建议备份关键数据并在测试环境验证

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2025-08-12，如有侵权请联系 cloudcommunity@tencent.com 删除

运维

本文分享自 Python运维开发微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度