7.Prometheus监控进阶之自定义监控业务应用

全栈工程师修炼指南

发布于 2022-09-29 19:17:02

2.6K0

文章被收录于专栏：全栈工程师修炼之路全栈工程师修炼之路

[TOC]

0x00 测控与客户端

1.前言简述

描述: Prometheus 可以通过直接测控或者客户端库来测控业务或者应用，目前我们可以采用多种不同语言编写客户端库包括(GO/Python/Java/Ruby)等客户端；

Tips : 应用程序监控指标的最重要的三个部分是导入模块、指标指定以及测控对象。

Tips : 指标的名称必须是唯一的，一般的为避免此种情况在文件级别定义你的指标，而不是在类、函数或者方法级别。

2.环境准备

描述: 在 Python 3 安装和使用Prometheus客户端库以及Flask模块, 在后面的演示中注意采用Python进行实现。

pip install prometheus_client
pip install flask

Tips : 在基础得演示阶段我可以使用Python3自带的http.server模块启动一个简易Web容器。

3.快速入门

3.1 Python 展示Prometheus指标

描述: 废话不多说上代码

方式1.python

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author:WeiyiGeek
# Desc: Hello World 实例采用Python暴露Prometheus
import http.server
from prometheus_client import start_http_server
from time import sleep
class Hello(http.server.BaseHTTPRequestHandler): 
  def do_GET(self):
    self.send_response(200,'ok')
    self.end_headers()
    self.wfile.write(b"Hello World! - Python Promethus_Client!")
  
if __name__ == '__main__':
  print("Hello World, Start Prometheus Client Server!")
  start_http_server(8000)
  server = http.server.HTTPServer(('0.0.0.0',8001),Hello)
  server.serve_forever()

执行结果: 测试访问:127.0.0.1:8001URL可以看到Hello World! - Python Promethus_Client输出,访问http://127.0.0.1:8000/metrics查看默认的指标

WeiyiGeek.HelloWorld

方式2.Python WSGI 描述: Web Server Gateway Interface(WSGI)是一个标准的Python的Web应用,在Python客户端提供了一个WSGI应用,通过链接WSGI应用你可以添加身份验证等中间件。

描述: 在Python WSGI 中配置演示即不用开启两个端口服务来配合prometheus数据拉取和数据生成。

# -*- coding:utf-8 -*-
# Author:WeiyiGeek
# Desc: Hello World 实例采用 prometheus_client WSGI 暴露Prometheus

from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server

metrics_app = make_wsgi_app()

def HelloWSGI(environ, start_fn):
  # 注意此/metrics可以自定义控制
  if environ['PATH_INFO'] == '/metrics':
    return metrics_app(environ, start_fn)

  start_fn('200 OK', [])
  return [b'Hello World! - Python Promethus_Client WSGI!']

if __name__ == '__main__':
  print("Hello World!, Start Prometheus Client Server!")
  httpd = make_server('0.0.0.0', 8000, HelloWSGI)
  httpd.serve_forever()

效果示例:

WeiyiGeek.prometheus_client-WSGI

3.2 Prometheus 四种数据类型测控演示

Counter 数据类型 描述: 该数据类型在测控中是使用最频繁的数据类型，其用于记录事件的数量或者大小，并通常用来跟踪某个特定代码路径被执行频率和记录数、服务的字节数以及。

no code,no bi bi;

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author: WeiyiGeek
# Desc: 演示在测控中使用 Counter 数据类型记录指标
import http.server
import random
from prometheus_client import start_http_server,Counter

# REQUESTS 记录请求 / 返回的次数(注意第二个参数是注释)
REQUESTS = Counter('http_server_total', 'http server requested frequency count.')
# EXCEPTIONS 使用上下文管理器对异常情况进行次数
EXCEPTION = Counter('http_server_execption_total', 'http server requested execption frequency count.')
EXCEPTIONS = Counter('http_server_execptions_total', 'http server requested execptions frequency count.')
# MONEYS 统计以欧元结算的销售额
MONEYS = Counter('http_server_euro_moneys_total','Euros made serving Http Server.')

class Hello(http.server.BaseHTTPRequestHandler): 
  # 作为函数的装饰器
  @EXCEPTIONS.count_exceptions()
  def do_GET(self):
    # 记录每请求一次便会+1
    REQUESTS.inc()
    # 模块中自带异常计数功能, 它不会干扰应用程序的逻辑(建议采用修饰器的方式)
    # with EXCEPTIONS.count_exceptions():
    #   if random.randint(0,10) < 1:
    #     raise Exception
    # 以欧元结算的销售额
    euros = random.randint(999,9999)
    MONEYS.inc(euros)
    self.send_response(200)
    self.end_headers()
    self.wfile.write("Hello World! Prometheus Counter Example for {} euros.".format(euros).encode())
  
if __name__ == '__main__':
  print("Hello World, Start Prometheus Client /metrics Server Port 8000!")
  start_http_server(8000)
  server = http.server.HTTPServer(('0.0.0.0',8001),Hello)
  server.serve_forever()

执行结果: 访问http://127.0.0.1:8000/metrics查看上述定义和监控的指标。

# HELP http_server_total http server requested frequency count.
# TYPE http_server_total counter
http_server_total 23.0
# HELP http_server_created http server requested frequency count.
# TYPE http_server_created gauge
http_server_created 1.6232518107397504e+09
# HELP http_server_execption_total http server requested execption frequency count.
# TYPE http_server_execption_total counter
http_server_execption_total 0.0
# HELP http_server_execption_created http server requested execption frequency count.
# TYPE http_server_execption_created gauge
http_server_execption_created 1.6232518107397504e+09
# HELP http_server_execptions_total http server requested execptions frequency count.
# TYPE http_server_execptions_total counter
http_server_execptions_total 4.0
# HELP http_server_execptions_created http server requested execptions frequency count.
# TYPE http_server_execptions_created gauge
http_server_execptions_created 1.6232518107397504e+09
# HELP http_server_euro_moneys_total Euros made serving Http Server.
# TYPE http_server_euro_moneys_total counter
http_server_euro_moneys_total 90706.0
# HELP http_server_euro_moneys_created Euros made serving Http Server.
# TYPE http_server_euro_moneys_created gauge
http_server_euro_moneys_created 1.6232518107397504e+09

Tips : 我们采用Prometheus进行监控我们创建的测控应用;

scrape_configs:
- job_name: control
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  static_configs:
  - targets:
    - 100.201.12.103:8000

WeiyiGeek.Counter

Gauge 数据类型 描述: 它存放的是一些当前状态的快照其值是可以动态改变的因此可以将负数传给该类型的inc方法, 例如常用于队列中元素的个数、内存以及磁盘的使用率、活跃的线程数或者回调指定的其它函数。

演示代码:

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author: WeiyiGeek
# Desc: 演示在测控中使用 Counter 数据类型记录指标
import http.server
import time
from prometheus_client import Gauge, start_http_server

# INPROGRESS / LAST 跟踪正在处理中的请求调用数目以及最后一个调用完成时间.
INPROGRESS = Gauge('app_inprogress', 'number of http server in progress')
LAST = Gauge('app_last_seconds', 'The last time a app was Served')

# INPROGRESS_Modify 此种方式也可跟踪正在处理中的请求其调用完成时间.
INPROGRESSMODIFY = Gauge('app_inprogress_modify', 'number of http server in progress modify')
LASTTIME = Gauge('app_last_time_seconds', 'The last time a app was Served')

# TIME 回调函数使用, 例如用来返回当前时间带有`set_functiuon()`简单示例
TIME = Gauge('current_time_seconds','The Current Time')

class Hello(http.server.BaseHTTPRequestHandler): 
  @INPROGRESSMODIFY.track_inprogress()
  def do_GET(self):
    INPROGRESS.inc(3)  # +3
    self.send_response(200)
    self.end_headers()
    self.wfile.write(b"Hello World! Prometheus Gauge Type Example")
    LAST.set(time.time())
    LASTTIME.set_to_current_time()
    # 每次都会进行变化
    TIME.set_function(lambda: time.time())
    INPROGRESS.dec()  # -1
    print("End Time: {}".format(time.asctime(time.localtime(time.time()))))
  
if __name__ == '__main__':
  print("Hello World, Start Prometheus Client /metrics Server!\nServer: 127.0.0.1:8000/metrics ")
  start_http_server(8000)
  server = http.server.HTTPServer(('0.0.0.0',8001),Hello)
  server.serve_forever()

结果演示: http://127.0.0.1:8000/metrics

# HELP app_inprogress number of http server in progress
# TYPE app_inprogress gauge
app_inprogress 2.0    # 每次访问 (+3-1)
# HELP app_last_seconds The last time a app was Served
# TYPE app_last_seconds gauge
app_last_seconds 1.6233011413151145e+09
# HELP app_inprogress_modify number of http server in progress modify
# TYPE app_inprogress_modify gauge
app_inprogress_modify 0.0
# HELP app_last_time_seconds The last time a app was Served
# TYPE app_last_time_seconds gauge
app_last_time_seconds 1.6233011413151145e+09
# HELP current_time_seconds The Current Time
# TYPE current_time_seconds gauge
current_time_seconds 1.6233011425841126e+09

Summary 数据类型 描述: 该数据类型主要针对于系统性能的监测(平均延迟数据)，例如除了后端的延时你可能也希望跟踪收到的后端响应体的大小，指标名称即指标_count 与指标_sum分别代表调用数量和测控值得总和。

代码演示:

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author: WeiyiGeek
# Desc: 演示在测控中使用 Summary 数据类型记录指标
import http.server
import time
from prometheus_client import Summary,start_http_server

# LATENCY 跟踪app程序的耗时。
LATENCY = Summary('app_latency_seconds','app 程序耗时统计单位秒')
# LATENCYMODIFY 采用修饰器来跟踪app程序的耗时。
LATENCYMODIFY = Summary('app_latencymodify_seconds','采用修饰器来验证 app 程序耗时统计单位秒')

class Hello(http.server.BaseHTTPRequestHandler):
  @LATENCYMODIFY.time()
  def do_GET(self):
    # 程序起始时间
    start = time.time()
    self.send_response(200)
    self.end_headers()
    self.wfile.write(b"Hello World! Prometheus Summary Example!")
    LATENCY.observe(time.time() - start)  

if __name__ == '__main__':
  print("Hello World, Start Prometheus Client /metrics Server!\nServer: 127.0.0.1:8000/metrics")
  start_http_server(8000)
  server = http.server.HTTPServer(('0.0.0.0',8001),Hello)
  server.serve_forever()

结果说明: 默认是指标 => 指标名称_count 与指标名称_sum

# HELP app_latency_seconds app 程序耗时统计单位秒
# TYPE app_latency_seconds summary
app_latency_seconds_count 15.0                # app 页面请求的次数
app_latency_seconds_sum 0.005997180938720703  # 每次 app 页面生成到展示的时间之和
# HELP app_latency_seconds_created app 程序耗时统计单位秒
# TYPE app_latency_seconds_created gauge
app_latency_seconds_created 1.6233078021234589e+09
# HELP app_latencymodify_seconds 采用修饰器来验证 app 程序耗时统计单位秒
# TYPE app_latencymodify_seconds summary
app_latencymodify_seconds_count 15.0
app_latencymodify_seconds_sum 0.006268700000005012
# HELP app_latencymodify_seconds_created 采用修饰器来验证 app 程序耗时统计单位秒
# TYPE app_latencymodify_seconds_created gauge
app_latencymodify_seconds_created 1.6233078021234589e+09

Tips ：在Prometheus中获得最后一分钟的平均延时,PromQL查询语句rate(app_latency_seconds_sum{job="control"}[1m]) / rate (app_latency_seconds_count{job="control"}[1m])。

WeiyiGeek.app_latency_seconds_sum

Histogram 数据类型 描述: 该类型主要是用于告诉你低于某一个值得事件个数(分位数), 例如0.95分位数为300ms代表95%得请求耗时小于300ms。 Histogram 数据类型指标是由_count,_sum,_bukect组成。

关键说明：桶(Bucket)它是一个counter数据类型的时序集合(必须有序),但是桶的数量影响着性能我们可以采用Prometheus中metric_relable_configs来丢弃一些无关的桶。例如一组桶(1ms, 10ms, 25ms)用来跟踪落入每个桶中的事件个数。

代码演示:

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author: WeiyiGeek
# Desc: 演示在测控中使用 Histogram 数据类型记录指标
import http.server
import time
from prometheus_client import Histogram,start_http_server

# LATENCYTIME 使用时间函数来跟踪延时(Histogram)
LATENCYTIME = Histogram('app_latency_time_seconds','app 程序访问延迟时间统计',buckets=[0.0001 * (2**x) for x in range(1,10)])

class Hello(http.server.BaseHTTPRequestHandler):
  @LATENCYTIME.time()
  def do_GET(self):
    self.send_response(200)
    self.end_headers()
    self.wfile.write(b"Hello World! Prometheus Histogram Example!")

if __name__ == '__main__':
  print("Hello World, Start Prometheus Client /metrics Server!\nServer: 127.0.0.1:8000/metrics")
  start_http_server(8000)
  server = http.server.HTTPServer(('0.0.0.0',8001),Hello)
  server.serve_forever()

结果说明:

# HELP app_latency_time_seconds app 程序访问延迟时间统计
# TYPE app_latency_time_seconds histogram
app_latency_time_seconds_bucket{le="0.0002"} 0.0
app_latency_time_seconds_bucket{le="0.0004"} 31.0
app_latency_time_seconds_bucket{le="0.0008"} 51.0
app_latency_time_seconds_bucket{le="0.0016"} 56.0
app_latency_time_seconds_bucket{le="0.0032"} 57.0
app_latency_time_seconds_bucket{le="0.0064"} 57.0
app_latency_time_seconds_bucket{le="0.0128"} 57.0
app_latency_time_seconds_bucket{le="0.0256"} 57.0
app_latency_time_seconds_bucket{le="0.0512"} 57.0
app_latency_time_seconds_bucket{le="+Inf"} 57.0
app_latency_time_seconds_count 57.0
app_latency_time_seconds_sum 0.027456399999900682
# HELP app_latency_time_seconds_created app 程序访问延迟时间统计
# TYPE app_latency_time_seconds_created gauge
app_latency_time_seconds_created 1.6233113814601374e+09

Tips : 分位数与百分位的选择我们知道95%是0.95分位数,由于更喜欢基本单位所有Prometheus默认选择分位数进行展示,而在Grafana中表示比例时常常使用百分位。 Tips : 注意如果是 Histogram 数据类型其必须是有标签的,并且 +Inf 分桶是必需的不应该丢弃,。 Tips : 我们可以采用PromQL中的函数激素那桶中的分位数，例如0.95分位数(第95个百分位数)其表达式如下histogram_quantile(0.95, rate(app_latency_time_seconds_bucket[1m])); Tips : 服务等级协议(Service-Level Agreements,SLA) 显式指定一个用来记录延迟的桶，例如在 histogram 中有一个0.0016s的桶,可通过以下准确的计算出耗时超过0.0016s的请求从而判断是否达到SLA要求。

# app_latency_time_seconds_bucket{le="0.0016"} 56.0
# app_latency_time_seconds_bucket{le="+Inf"} 57.0
app_latency_time_seconds_bucket{le="0.0016"} / ignoring(le) app_latency_time_seconds_bucket{le="+Inf"}   
{instance="10.20.172.103:8000", job="control"}	0.9824561403508771  # => 56 / 57

Tips : 同样我们可以利用下面的表达式公式进行计算最后30分钟的延迟rate(app_latency_time_seconds_sum{job="control"}[30m]) / rate (app_latency_time_seconds_count{job="control"}[30m])

WeiyiGeek.histogram_quantile

3.3 Python Prometheus 库实践

参考地址: https://prometheus.io/docs/instrumenting/clientlibs/

Counter : 计数器上升并在进程重新启动时重置。

from prometheus_client import Counter
c = Counter('my_failures', 'Description of counter')
c.inc()     # Increment by 1
c.inc(1.6)  # Increment by given value

Tips : Counter 指标类型只能增加，重启时置零，可以使用计数器来表示所服务的请求数，已完成的任务或错误。

Gauge : 仪表可以升降。

# Example
from prometheus_client import Gauge
g = Gauge('my_inprogress_requests', 'Description of gauge')
g.inc()      # Increment by 1
g.dec(10)    # Decrement by given value
g.set(4.2)   # Set to a given value

# 设置 Labels
from prometheus_client import Counter
c = Counter('my_requests_total', 'HTTP Failures', ['method', 'endpoint'])
c.labels('get', '/').inc()
c.labels('post', '/submit').inc()
# Labels can also be passed as keyword-arguments:
from prometheus_client import Counter
c = Counter('my_requests_total', 'HTTP Failures', ['method', 'endpoint'])
c.labels(method='get', endpoint='/').inc()
c.labels(method='post', endpoint='/submit').inc()

# Gauge example:
from prometheus_client import Gauge
g = Gauge('my_inprogress_requests', 'Description of gauge',['mylabelname'])
g.labels(mylabelname='labelname').set(30)

Tips : Gauge 指标类型表示单个数值，可以任意地上升和下降的度量。通常用于测量值，如温度或当前内存使用情况。

Tips : Gauge 指标不能使用rate，irate等增长率统计表达式

Summary、Histogram: 两种指标类型使用较少。

import psutil
import socket
from prometheus_client import Gauge,start_http_server
from time import sleep

g = Gauge('cup_use_percent_test_metric', 'Description of gauge',['hostip'])
host_ip = socket.gethostbyname(socket.getfqdn(socket.gethostname()))      # 获取本机IP

def get_cup_use():
    cup_use_percent = psutil.cpu_percent(0.5)      # 获取CPU使用率
    g.labels(hostip=host_ip).set(cup_use_percent)  # 本机IP传入labels，CPU使用率传入value

if __name__ == '__main__':
  start_http_server(8006)                          # 启动8006端口
  while True:
    get_cup_use()
    sleep(10)

Python prometheus-client 安装与使用

pip install prometheus-client

# (1) Python 封装

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from prometheus_client import Counter, Gauge, Summary
from prometheus_client.core import CollectorRegistry
from prometheus_client.exposition import choose_encoder

class Monitor:
    def __init__(self):
    # 注册收集器&最大耗时map
    self.collector_registry = CollectorRegistry(auto_describe=False)
    self.request_time_max_map = {}

    # 接口调用summary统计
    self.http_request_summary = Summary(name="http_server_requests_seconds",
                                   documentation="Num of request time summary",
                                   labelnames=("method", "code", "uri"),
                                   registry=self.collector_registry)
    # 接口最大耗时统计
    self.http_request_max_cost = Gauge(name="http_server_requests_seconds_max",
                                  documentation="Number of request max cost",
                                  labelnames=("method", "code", "uri"),
                                  registry=self.collector_registry)

    # 请求失败次数统计
    self.http_request_fail_count = Counter(name="http_server_requests_error",
                                      documentation="Times of request fail in total",
                                      labelnames=("method", "code", "uri"),
                                      registry=self.collector_registry)

    # 模型预测耗时统计
    self.http_request_predict_cost = Counter(name="http_server_requests_seconds_predict",
                                        documentation="Seconds of prediction cost in total",
                                        labelnames=("method", "code", "uri"),
                                        registry=self.collector_registry)
    # 图片下载耗时统计
    self.http_request_download_cost = Counter(name="http_server_requests_seconds_download",
                                         documentation="Seconds of download cost in total",
                                         labelnames=("method", "code", "uri"),
                                         registry=self.collector_registry)

    # 获取/metrics结果
    def get_prometheus_metrics_info(self, handler):
        encoder, content_type = choose_encoder(handler.request.headers.get('accept'))
        handler.set_header("Content-Type", content_type)
        handler.write(encoder(self.collector_registry))
        self.reset_request_time_max_map()

    # summary统计
    def set_prometheus_request_summary(self, handler):
        self.http_request_summary.labels(handler.request.method, handler.get_status(), handler.request.path).observe(handler.request.request_time())
        self.set_prometheus_request_max_cost(handler)

    # 自定义summary统计
    def set_prometheus_request_summary_customize(self, method, status, path, cost_time):
        self.http_request_summary.labels(method, status, path).observe(cost_time)
        self.set_prometheus_request_max_cost_customize(method, status, path, cost_time)

    # 失败统计
    def set_prometheus_request_fail_count(self, handler, amount=1.0):
        self.http_request_fail_count.labels(handler.request.method, handler.get_status(), handler.request.path).inc(amount)

    # 自定义失败统计
    def set_prometheus_request_fail_count_customize(self, method, status, path, amount=1.0):
        self.http_request_fail_count.labels(method, status, path).inc(amount)

    # 最大耗时统计
    def set_prometheus_request_max_cost(self, handler):
        requset_cost = handler.request.request_time()
        if self.check_request_time_max_map(handler.request.path, requset_cost):
            self.http_request_max_cost.labels(handler.request.method, handler.get_status(), handler.request.path).set(requset_cost)
            self.request_time_max_map[handler.request.path] = requset_cost

    # 自定义最大耗时统计
    def set_prometheus_request_max_cost_customize(self, method, status, path, cost_time):
        if self.check_request_time_max_map(path, cost_time):
            self.http_request_max_cost.labels(method, status, path).set(cost_time)
            self.request_time_max_map[path] = cost_time

    # 预测耗时统计
    def set_prometheus_request_predict_cost(self, handler, amount=1.0):
        self.http_request_predict_cost.labels(handler.request.method, handler.get_status(), handler.request.path).inc(amount)

    # 自定义预测耗时统计
    def set_prometheus_request_predict_cost_customize(self, method, status, path, cost_time):
        self.http_request_predict_cost.labels(method, status, path).inc(cost_time)

    # 下载耗时统计
    def set_prometheus_request_download_cost(self, handler, amount=1.0):
        self.http_request_download_cost.labels(handler.request.method, handler.get_status(), handler.request.path).inc(amount)

    # 自定义下载耗时统计
    def set_prometheus_request_download_cost_customize(self, method, status, path, cost_time):
        self.http_request_download_cost.labels(method, status, path).inc(cost_time)

    # 校验是否赋值最大耗时map
    def check_request_time_max_map(self, uri, cost):
        if uri not in self.request_time_max_map:
            return True
        if self.request_time_max_map[uri] < cost:
            return True
        return False

    # 重置最大耗时map
    def reset_request_time_max_map(self):
        for key in self.request_time_max_map:
            self.request_time_max_map[key] = 0.0


# (2) 封装调用
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import tornado
import tornado.ioloop
import tornado.web
import tornado.gen
from datetime import datetime
from tools.monitor import Monitor

global g_monitor

class ClassifierHandler(tornado.web.RequestHandler):
    def post(self):
        # TODO Something you need
        # work....
        # 统计Summary，包括请求次数和每次耗时
        g_monitor.set_prometheus_request_summary(self)
        self.write("OK")


class PingHandler(tornado.web.RequestHandler):
    def head(self):
        print('INFO', datetime.now(), "/ping Head.")
        g_monitor.set_prometheus_request_summary(self)
        self.write("OK")

    def get(self):
        print('INFO', datetime.now(), "/ping Get.")
        g_monitor.set_prometheus_request_summary(self)
        self.write("OK")


class MetricsHandler(tornado.web.RequestHandler):
    def get(self):
        print('INFO', datetime.now(), "/metrics Get.")
    g_monitor.set_prometheus_request_summary(self)
    # 通过Metrics接口返回统计结果
    	g_monitor.get_prometheus_metrics_info(self)
    

def make_app():
    return tornado.web.Application([
        (r"/ping?", PingHandler),
        (r"/metrics?", MetricsHandler),
        (r"/work?", ClassifierHandler)
    ])

if __name__ == "__main__":
    g_monitor = Monitor()
  
  app = make_app()
    app.listen(port)
    tornado.ioloop.IOLoop.current().start()

0x01 PushGateway 使用实践

描述: PushGateway 作为 Prometheus 生态中的一个重要一员，它允许任何客户端向其 Push 符合规范的自定义监控指标再结合 Prometheus 统一收集监控。

1.基础说明

我们常常可以在以下两种场景中使用:

1) 场景1: Prometheus 采用定时 Pull 模式，可能由于子网络或者防火墙的原因，不能直接拉取各个 Target 的指标数据，此时可以采用各个 get 往 PushGateway 上 Push 数据，然后 Prometheus 去 PushGateway 上定时 pull。
2) 场景2: 在企业内部需要监控多个业务数据并且需要将各个不同的业务数据进行统一汇总时也可以采用PushGateway 来统一收集，然后 Prometheus 来统一拉取。

实验目的： 描述: 实现Pushgateway组件多种将采集数据呈现给Prometheus，通过Python和Curl进行上传指标数据到Pushgateway之中。

实验环境: Prometheus –> V2.27.1 Pushgateway –> V1.4.1

Tips ： Pushgateway不是将Prometheus从拉取模式转为推送模式的方案。

2.实战配置

描述: 假设您已经安装好Prometheus服务端环境在此基础之上进行如下配置。

Step 1.下载安装Pushgateway并注册为系统服务地址。

sudo tee /usr/lib/systemd/system/pushgateway.service <<'EOF'
[Unit]
Description=Prometheus pushgateway Server Systemd
Documentation=https://prometheus.io
After=network.target

[Service]
Type=simple
StandardError=journal
ExecStart=/usr/local/pushgateway/pushgateway --web.listen-address=:9091
Restart=on-failure
RestartSec=3s

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && systemctl restart pushgateway.service

WeiyiGeek.验证Pushgateway服务

Step 2.在Prometheus.yml配置文件中设置抓取本地安装pushgateway的IP及其端口配置。

- job_name: 'pushgateway'
    honor_labels: true
    static_configs:
    - targets: ['localhost:9091']

Tips: 设置honor_labels: true是因为prometheus配置pushgateway 的时候,也会指定job和instance,但是它只表示pushgateway实例,不能真正表达收集数据的含义。所以配置pushgateway需要添加honor_labels:true,避免收集数据本身的job和instance被覆盖。

Step 3.我们可以通过多种方式将数据推送到Pushgateway之中，最常用就是官网提供的Python library模块以及Curl POST指标数据上传。描述: 在Python客户端prometheus模块之中主要采用以下三个函数以及采用request模块进行POST指标推送，官网帮助: https://github.com/prometheus/client_python；
- push_to_gateway : 首先移除所有跟这个作业相关的指标并记录下刚推送的数据(实际采用的是HTTP的PUT方法实现)
- pushadd_to_gateway : 覆盖已经重名的指标而之前创建不同名的指标将保持不变(实际采用的是HTTP的POST方法实现)
- delete_from_gateway : 删除所有与该作业相关的指标(实际采用的是HTTP的DELETE方法实现)

综合实践:

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author: WeiyiGeek
# Desc: 演示在测控中使用Python自定义指标并将采集的数据推送到Pushgateway上

import requests,random,pprint
from prometheus_client import CollectorRegistry, Gauge, pushadd_to_gateway, registry

pushgateway_ip = "10.10.107.249"
pushgateway_port = "9091"

# 方式1.创建一个测控批处理任务并且将指标数据推送给Pushgateway。
registry = CollectorRegistry()
durationtime = Gauge('app_job_duration_time_seconds','duration of my batch in seconds', registry=registry)
try:
  with durationtime.time():
    # Guage(metric_name,HELP,labels_name,registry=registry)
    suc = Gauge('app_job_sucess_seconds','Last time my job successfully finished',['dst_ip','name'],registry=registry)
    suc.labels('192.168.1.1','app').set(random.randrange(999,9999))
    suc.labels('192.168.1.11','weblogic').dec(2)  #dec递减2
    suc.labels('192.168.1.12','nginx').inc()  #inc递增，默认增1
finally:
  pushadd_to_gateway(pushgateway_ip+":"+pushgateway_port,job='app',registry=registry)

# 方式2.采用request模块进行指标数据上传。
quota = """
# TYPE test_metric counter
test_metric{label="test"} %s
# TYPE another_metric gauge
# HELP another_metric Just an example.
another_metric{label="prod"} %s
"""%(random.randrange(1,100),random.random())
# quota_bin = binascii.hexlify(quota)
res = requests.post(url="http://"+pushgateway_ip+":"+pushgateway_port+"/metrics/job/app/custom_instance/test_instance",data=quota,headers={'Content-Type': 'application/octet-stream'})
pprint.pprint(res)
pprint.pprint(res.status_code)

# -- 执行结果 ---
<Response [200]>
200

WeiyiGeek.pushadd_to_gateway

Step 4.我们采用Linux中的curl进行上传指标数据到Pushgateway之中。

# - 1.推送单个或者多个监控指标到网关
echo "name_count 1" | curl --data-binary @- http://10.10.107.249:9091/metrics/job/curl_job
cat <<EOF | curl --data-binary @- http://10.10.107.249:9091/metrics/job/curl_job/app/weiyigeek
# TYPE app_init_total counter
# HELP This is app initialization count 
app_init_total 1314
# TYPE app_init_metric gauge
# HELP app_init_metric Just an example.
app_init_metric 1024.16
EOF

# -2.把多个指标写入到文件里通过curl的@读取文件来上传指标
tee metrics <<'EOF'
http_request_total{code="200",domain="weiyigeek.top"} 276683
http_request_total{code="400",domain="weiyigeek.top"} 0
http_request_total{code="408",domain="weiyigeek.top"} 7
http_request_total{code="401",domain="weiyigeek.top"} 0
http_request_total{schema="http",domain="weiyigeek.top"} 277725
http_request_total{schema="https",domain="weiyigeek.top"} 0
http_request_time{code="total",domain="weiyigeek.top"} 76335.809000
http_request_uniqip{domain="weiyigeek.top"} 216944
http_request_maxip{clientip="172.27.0.12",domain="weiyigeek.top"} 81
EOF
curl -XPOST --data-binary @metrics http://10.10.107.249:9091/metrics/job/curl_job/app/my_blog/site/weiyigeek.top

WeiyiGeek.curl-post

Step 5.测控进行指标的多标签生成实例演示

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Author: WeiyiGeek
# Desc: 演示在测控中为指标设置多个标签
import http.server
import random
from prometheus_client import Gauge, Summary, start_http_server,Counter

# - REQUESTS 记录网站各访问路径与请求方法合计次数(多标签)
REQUESTS = Counter('http_requested_total', '网站各访问路径与请求方法合计',['path', 'method'])
# - ENUM 记录枚举值出现次数(多标签)
ENUM = Counter('http_enum_total', '请求网站后枚举值的此时',['index','name'])
DEMO_ARAAY = ["one","two","there","four"]
# - MODIFIER 使用带有标签的修饰器(多标签)
MODIFIER = Summary('http_modifier_seconds', '请求网站中使用带有标签的修饰器', ['index','name'])
modifier = MODIFIER.labels('MODIFIER','weiyigeek')
# - DEMOINFO - 暴露指标的指定信息
python_info = { "implementation": "CPython", "major": "3", "minor": "7", "patchlevel": "0", "version": "3.8.0" }
DEMOINFO = Gauge('demo_info','暴露指标的指定信息',labelnames=python_info.keys())
DEMOINFO.labels(**python_info).set(1)


class Hello(http.server.BaseHTTPRequestHandler): 
  @modifier.time()
  def do_GET(self):
    REQUESTS.labels(self.path, self.command).inc()
    ENUM.labels(DEMO_ARAAY[random.randint(0,DEMO_ARAAY.__len__() - 1)], 'ENUM').inc()
    self.send_response(200)
    self.end_headers()
    self.wfile.write(b"Hello World! Multi label Example!")
  
if __name__ == '__main__':
  print("Start Prometheus Client /metrics Server!")
  start_http_server(8000)
  server = http.server.HTTPServer(('0.0.0.0',8001),Hello)
  server.serve_forever()

生成示例:

# HELP http_requested_total 网站各访问路径与请求方法合计
# TYPE http_requested_total counter
http_requested_total{method="GET",path="/number"} 4.0
http_requested_total{method="GET",path="/favicon.ico"} 7.0
http_requested_total{method="GET",path="/result"} 1.0
http_requested_total{method="GET",path="/"} 2.0
http_requested_total{method="GET",path="/api/"} 5.0
# HELP http_requested_created 网站各访问路径与请求方法合计
# TYPE http_requested_created gauge
http_requested_created{method="GET",path="/number"} 1.623854061065871e+09
http_requested_created{method="GET",path="/favicon.ico"} 1.6238540611298337e+09
http_requested_created{method="GET",path="/result"} 1.6238540711289563e+09
http_requested_created{method="GET",path="/"} 1.6238540782242467e+09
http_requested_created{method="GET",path="/api/"} 1.6238541072589352e+09
# HELP http_enum_total 请求网站后枚举值的此时
# TYPE http_enum_total counter
http_enum_total{index="two",name="ENUM"} 4.0
http_enum_total{index="there",name="ENUM"} 4.0
http_enum_total{index="one",name="ENUM"} 6.0
http_enum_total{index="four",name="ENUM"} 5.0
# HELP http_enum_created 请求网站后枚举值的此时
# TYPE http_enum_created gauge
http_enum_created{index="two",name="ENUM"} 1.623854061065871e+09
http_enum_created{index="there",name="ENUM"} 1.6238540611298337e+09
http_enum_created{index="one",name="ENUM"} 1.623854061224779e+09
http_enum_created{index="four",name="ENUM"} 1.6238540614066763e+09
# HELP http_modifier_seconds 请求网站中使用带有标签的修饰器
# TYPE http_modifier_seconds summary
http_modifier_seconds_count{index="MODIFIER",name="weiyigeek"} 19.0
http_modifier_seconds_sum{index="MODIFIER",name="weiyigeek"} 0.008954100000021725
# HELP http_modifier_seconds_created 请求网站中使用带有标签的修饰器
# TYPE http_modifier_seconds_created gauge
http_modifier_seconds_created{index="MODIFIER",name="weiyigeek"} 1.6238540450573144e+09
# HELP demo_info 暴露指标的指定信息
# TYPE demo_info gauge
demo_info{implementation="CPython",major="3",minor="7",patchlevel="0",version="3.8.0"} 1.0

Step 6.上传我们删除的指标，删除标识的组中的所有指标{job="curl_job"}请注意，这不包括{job="curl_job",instance="some_instance"}中的指标，即使这些指标具有相同的 job 标签。

# - 1.只会删除标签只有job="curl_job"的指标
curl -X DELETE http://10.10.107.249:9091/metrics/job/curl_job/

# - 2.删除指定工作任务名以及指标标签的指标
curl -X DELETE http://10.10.107.249:9091/metrics/job/curl_job/app/weiyigeek

# - 3.当pushgateway启动时设置--web.enable-admin-api参数就可以一键清空所有指标了
curl -X PUT http://10.10.107.249:9091/api/v1/admin/wipe

Tips : 注意在一次推送后如果您不进行清除则该组信息会永远的保存在Pushgateway，在停用某个批处理作业时一定要把存储在pushgateway的指标进行清除。

0X02 Blackbox 使用实践

1.基础说明

描述: 我们可以通过 Blackbox 黑盒导出器来监控目标HTTP, HTTPS, DNS, TCP and ICMP等协议服务，主要用于那些不能直接再应用实例中运行导出器的情况下使用。

例如: 我们可以监控Web服务是否可以正常为用户提供服务，通常是通过访问负载均衡或者VIP地址来监控该服务。

Blackbox exporter configuration: https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md

# - 通用占位符定义如下：
  * <boolean>: a boolean that can take the values true or false
  * <int>: a regular integer
  * <duration>: a duration matching the regular expression [0-9]+(ms|[smhdwy])
  * <filename>: a valid path in the current working directory
  * <string>: a regular string
  * <secret>: a regular string that is a secret, such as a password
  * <regex>: a regular expression

# - exporter configuration file format
Module
  # The protocol over which the probe will take place (http, tcp, dns, icmp).
  prober: <prober_string>
  # How long the probe will wait before giving up.
  [ timeout: <duration> ]
  # The specific probe configuration - at most one of these should be specified.
  [ http: <http_probe> ]
  [ tcp: <tcp_probe> ]
  [ dns: <dns_probe> ]
  [ icmp: <icmp_probe> ]

官方示例: https://github.com/prometheus/blackbox_exporter/blob/master/example.yml

2.实践操作

Step 1.准备 BlackBox Configure yaml File 分别进行imcp、Tcp、Http、Https、Dns探测实践

mkdir /usr/local/blackbox
tee /usr/local/blackbox/blackbox.yml <<'EOF'
modules:
  icmp_example:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
#     source_ip_address: "127.0.0.1"  # 多张网卡时指定
  tcp_example:
    prober: tcp
    timeout: 5s
  tcp_ssh_example: 
    prober: tcp
    timeout: 5s
    tcp:
      query_response:
      - expect: "^SSH-2.0-"               # 验证监控端是否以SSH进行响应 
  tcp_tls_example:
    prober: tcp
    timeout: 5s
    tcp:
      tls: true
  http_example:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      headers:
        User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"
      # 如果存在SSL，则探测失败。
      fail_if_ssl: false
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4" # defaults to "ip6"
  https_example:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      headers:
        User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"
      # 如果不存在SSL，则探测失败。
      fail_if_not_ssl: false 
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4" # defaults to "ip6"
      ip_protocol_fallback: false  # no fallback to "ip6"
  dns_example:
    prober: dns
    dns:
      query_name: "www.weiyigeek.top"
      query_type: "A"
      valid_rcodes:
      - NOERROR
  dns_tcp_example:
    prober: dns
    dns:
      transport_protocol: "tcp"    # defaults to "udp"
      preferred_ip_protocol: "ip4" # defaults to "ip6"
      query_name: "www.prometheus.io"
  dns_mx_example:
    prober: dns
    dns:
      query_name: "weiyigeek.top"
      query_type: "MX"
      validate_answer_rrs:
        fail_if_not_matches_regexp:
        - ".+"
EOF

Step 2.我们采用 Docker 容器启动 BlackBox Exporter 方便测试。

docker run --rm -d -p 9115:9115 --name blackbox_exporter -v /usr/local/blackbox:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml
d571d7587a031d01fabd82c780d8bb90bba4c88a06834e5e8b6f081a99442db9

WeiyiGeek.BlackBox Exporter

Step 3.imcp模块之监控目标，我们可以通过浏览器访问http://weiyigeek.top:9115/probe?target=223.6.6.6&module=icmp_example

# - 结果说明
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.5798e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.006506971  # 整个探测花费时间包括DNS解析时间
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 2.5798e-05  
probe_icmp_duration_seconds{phase="rtt"} 0.00625726
probe_icmp_duration_seconds{phase="setup"} 0.000126281
# HELP probe_icmp_reply_hop_limit Replied packet hop limit (TTL for ipv4)
# TYPE probe_icmp_reply_hop_limit gauge
probe_icmp_reply_hop_limit 113
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.372708602e+09  # 监测目标地址Hash
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4  # 使用IP协议版本
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1      # 成功返回1，否则返回0

Tips : Target 也可以为域名形式 /probe?module=icmp_example&target=www.weiyigeek.top 返回指标也是一样。

Step 4.tcp模块之监控目标探测

TCP 常规端口探测: /probe?module=tcp_example&target=192.168.12.185:80

# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# TYPE probe_success gauge
probe_success 1   # 成功

TCP SSH 端口探测（指定匹配规则）: /probe?module=tcp_ssh_example&target=10.20.172.248:22

# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1   # 匹配成功

TCP SSL 端口探测: /probe?module=tcp_tls_example&target=www.baidu.com:443

# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry date
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.627277462e+09   # TLS/SSL证书到期时间
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in unixtime
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.627277462e+09    # 返回unixtime中最后一个SSL链到期时间
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="2ed189349f818f3414132ebea309e36f620d78a0507a2fa523305f275062d73c"} 1  # 证书链信息
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1   # 成功标志
# HELP probe_tls_version_info Returns the TLS version used, or NaN when unknown
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1  # 证书版本

Step 5.http模板之监控目标探测包括http协议和https协议 http 协议: /probe?module=http_example&target=blog.weiyigeek.top

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.186725698
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length -1
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 2.142812294
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 1.209361663
probe_http_duration_seconds{phase="processing"} 0.250921647
probe_http_duration_seconds{phase="resolve"} 0.186725698
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.495566665
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 22279
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1  # HTTP 协议版本
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.341618423e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1          # 探测结果

https 协议: /probe?target=prometheus.io&module=https_example

# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 3.301450584
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.35759203100000003
probe_http_duration_seconds{phase="processing"} 0.39931457400000003
probe_http_duration_seconds{phase="resolve"} 2.334050778
probe_http_duration_seconds{phase="tls"} 0.196995271
probe_http_duration_seconds{phase="transfer"} 0.012739035
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 1
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 16377
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 2
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.628424e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.628424e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="44c9e62838db98e79918c841e4b72529849e2e7e6654e337a32a74ec502edaa7"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.3"} 1

Step 6.dns模板之检测目标A记录，以及测试DNS服务器是否可通过TCP进行响应。(注意DNS通常使用UDP)

A记录解析探测: /probe?module=dns_example&target=8.8.8.8

# HELP probe_dns_additional_rrs Returns number of entries in the additional resource record list
# TYPE probe_dns_additional_rrs gauge
probe_dns_additional_rrs 0      # 返回附加资源记录列表中的条目数
# HELP probe_dns_answer_rrs Returns number of entries in the answer resource record list
# TYPE probe_dns_answer_rrs gauge
probe_dns_answer_rrs 2            # DNS 回答响应条目
# HELP probe_dns_authority_rrs Returns number of entries in the authority resource record list
# TYPE probe_dns_authority_rrs gauge
probe_dns_authority_rrs 0        # 返回权限资源记录列表中的条目数
# HELP probe_dns_duration_seconds Duration of DNS request by phase
# TYPE probe_dns_duration_seconds gauge
probe_dns_duration_seconds{phase="connect"} 7.9269e-05
probe_dns_duration_seconds{phase="request"} 0.072951973
probe_dns_duration_seconds{phase="resolve"} 2.9612e-05
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.9612e-05  # 返回探测dns查找所用的时间（以秒为单位）
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.073247895   # 返回以秒为单位完成探测所用的时间
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

DNC以TCP响应探测记录: /probe?module=dns_tcp_example&target=8.8.8.8

# HELP probe_dns_answer_rrs Returns number of entries in the answer resource record list
probe_dns_answer_rrs 1
# HELP probe_success Displays whether or not the probe was a success
probe_success 1

DNS探测器检查MX记录是否已消失: /probe?module=dns_mx_example&target=223.6.6.6

probe_dns_answer_rrs 2
probe_success 1

WeiyiGeek.DNS探测器检查MX记录

Step 7.采用Prometheus抓取Blackbox相应模块与对应target参数的指标值,修改yaml配置文件后重新Prometheus_server即可在target查看监控的站点。

# prometheus.yml
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [https_example]  # Look for a HTTP 200 response.
  static_configs:
    - targets:
      - http://weiyigeek.top    # Target to probe with http.
      - https://prometheus.io   # Target to probe with https.
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 192.168.12.107:9115  # The blackbox exporter's real hostname:port.

补充说明: 我们可以将static_configs换成file_sd_configs将监控目标写在file文件中，设置指定的标签值。

# - prometheus.yml
  file_sd_configs:
    - files:
      - /etc/prometheus/conf.d/discovery/blackbox.yaml
      refresh_interval: 1m

# - blackbox.yaml
- targets: [ 'http://weiyigeek.top','https://blog.weiyigeek.top']
  labels: {'env': 'prod', 'appname': 'mySite', 'label': 'blog'}
- targets: [ 'https://prometheus.io']
  labels: {'env': 'test', 'appname': 'prometheus官网', 'label': 'officialwebiste'}

Tips : 我们在Prometheus配置的Blackbox导出器的state值为UP并不意味着其探测成功，只是表示抓取成功你需要检查probe_success是否为1，其PromQL语法为probe_success{env="prod"}。

WeiyiGeek.Blackbox

Step 8.Prometheus 中 Alertmanager 警报规则编写并验证报警信息。

groups:
- name: normal
  rules:
  - alert: web_request
    expr: probe_success == 0
    for: 30s
    labels:
      severity: 'critical'
    annotations:
      summary: "站点 [ {{ $labels.instance }} ] , 应用名称: {{ $labels.appname }}, 标签来自: {{ $labels.label}}"

WeiyiGeek.Blackbox&Alertmanager

Tips: Prometheus 每次抓取的时候发送一个名为X-Prometheus-Scrape-Timeout-Seconds的HTTP请求头，Blackbox使用它来设置超时而不是缓冲区。

3.利用Prometheus实现外网网关及网站状态监测(企业实战)

描述: 有了上面的Blockbox基础学习实践，下面我们就可以直接上手。

Step 1.此处采用 Blackbox_exporter 二进制（0.19.0 / 2021-05-10）按安装方式进行设置，下载地址: https://github.com/prometheus/blackbox_exporter/releases 。

# Blackbox_exporter 下载
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.19.0/blackbox_exporter-0.19.0.linux-amd64.tar.gz
tar -zxvf blackbox_exporter-0.19.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/blackbox_exporter-0.19.0.linux-amd64   /usr/local/blackbox_exporter

# 版本信息查看
cd /usr/local/blackbox_exporter
./blackbox_exporter --version
blackbox_exporter, version 0.19.0 (branch: HEAD, revision: 5d575b88eb12c65720862e8ad2c5890ba33d1ed0)
  build user:       root@2b0258d5a55a
  build date:       20210510-12:56:44
  go version:       go1.16.4
  platform:         linux/amd6

Step 2.查看默认的 BlackBox Configure yaml File 文件.并按照需要的添加对象。

$ ls /usr/local/blackbox_exporter && cd /usr/local/blackbox_exporter
blackbox_exporter  blackbox.yml  LICENSE  NOTICE

# - 修改并添加自定义的对象后的BlackBox.yml.
$ tee /usr/local/blackbox_exporter/blackbox.yml <<'EOF'
modules:
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
  tcp:
    prober: tcp
    timeout: 5s
  tcp_tls:
    prober: tcp
    timeout: 5s
    tcp:
      tls: true
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []      # Defaults to 2xx
      method: GET
      headers:
        User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4" # Defaults to "ip6"
  http_post:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []       # Defaults to 2xx
      method: POST
      headers:
        User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"
        Content-Type: application/json
      body: '{}'
      fail_if_not_ssl: false 
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4" # defaults to "ip6"
      ip_protocol_fallback: false  # no fallback to "ip6"
  http_basic_auth_example:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Host: "login.example.com"
      basic_auth:
        username: "username"
        password: "mysecret"
EOF

Step 3.创建blackbox_exporter的systemd服务。

sudo tee /usr/lib/systemd/system/blackbox_exporter.service <<'EOF'
[Unit]
Description=blackbox_exporter
Documentation=https://prometheus.io
After=network.target

[Service]
Type=simple
StandardError=journal
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure
RestartSec=3s

[Install]
WantedBy=multi-user.target
EOF

Step 4.启动并加入开机启动服务序列，并对服务和端口运行状态进行验证。

# - 自动重启
sudo systemctl daemon-reload && systemctl restart blackbox_exporter.service
systemctl status blackbox_exporter

# - 监听进程查看
ss -lnpt | grep 9115
  # LISTEN     0      128       [::]:9115                  [::]:*                   users:(("blackbox_export",pid=1738,fd=3))

Step 5.在Prometheus配置文件加入blackbox_exporter，下面我们编辑prometheus配置文件来监测外网网关存活状态。

$ vi /usr/local/prometheus/prometheus.yml
# - 在配置文件中添加外网网关存活状态的job信息
  - job_name: 'ICMP-Ping'
    scrape_interval: 15s
    metrics_path: /probe
    params:
      module: [icmp]
    file_sd_configs:
    - refresh_interval: 30s
      files:
      - /etc/prometheus/backbox_icmp.yml
    relabel_configs:
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: ${1}
    - source_labels: [__param_target]
      regex: (.*)
      target_label: ping
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: 127.0.0.1:9115
  - job_name: 'HTTP-Check'
    scrape_interval: 30s
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
    - refresh_interval: 30s
      files:
      - /etc/prometheus/backbox_http.yml
    relabel_configs:
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: ${1}
    - source_labels: [__param_target]
      regex: (.*)
      target_label: http
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: 127.0.0.1:9115

# 以 文件的形式 进行目标发现。
# icmp 协议监控目标对象设置以及其标签设置
tee /etc/prometheus/backbox_icmp.yml <<'EOF'
- targets: [ '183.230.37.209','117.59.89.218']
  labels: {'env': 'prod','BusinessType': 'Gateway','site':'www.weiyigeek.top','msg': '官网出口IP地址'}
EOF

# http 协议监控目标对象设置以及其标签设置
tee /etc/prometheus/backbox_http.yml <<'EOF'
- targets: [ 'http://www.weiyigeek.top' ]
  labels: {'env': 'prod','BusinessType': 'Web','site': '个人官网-www.weiyigeek.top','msg': 'A10-Load Balance-VIP 地址'}
EOF

Step 6.我们可以利用promtool工具检查配置文件是否书写正确。

/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
Checking prometheus.yml
  SUCCESS: 2 rule files found

Tips : 我们可以采用以下方式进行热加载Prometheus或者BlackBox。

# - 动态重载blackbox配置
curl -XPOST http://127.0.0.1:9115/-/reload
# - 热加载Prometheus
curl -XPOST http://127.0.0.1:9090/-/reload

Step 7.然后访问Blackbox Exporter来查看验证请求以及在Prometheus中查看监控对象状态。

WeiyiGeek.Blackbox Exporter

Step 8.有了Prometheus拉取的数据我们就可以,在Grafana添加blackbox_exporter监测数据, 此处我们采用 blackbox_exporter 的 Grafana Dashbord 模板模块安装方式不再累述。

# Tips: 此模板需要下载饼状图插件。
wget -nv https://grafana.com/api/plugins/grafana-piechart-panel/versions/latest/download -O /tmp/grafana-piechart-panel.zip
grafana-cli plugins install grafana-piechart-panel
systemctl grafana-server restart