前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >《Prometheus监控实战》第6章 警报管理

《Prometheus监控实战》第6章 警报管理

作者头像
yeedomliu
发布2019-12-19 16:35:36
3K0
发布2019-12-19 16:35:36
举报
文章被收录于专栏:yeedomliuyeedomliuyeedomliu

第6章 警报管理

  • Prometheus是一个按功能划分的平台,指标的收集和存储与警报是分开的。警报管理功能由名为Alertmanager的工具提供,该工具是监控体系中的独立组件。我们需要在Prometheus服务器上定义警报规则,这些规则可以触发事件,然后传播到Altermanager。接下来,Alertmanager会决定如何处理相应的警报,进而解决去重等问题,还会确定在发送警报时使用的机制:实时消息、电子邮件或通过PagerDuty和VictorOps等工具

6.1 警报

  • 警报可以为我们提供一些指示,表明我们环境中的某些状态已发生变化,且通常会是比想象更糟的情况。一个好警报的关键是能够在正确的时间、以正确的理由和正确的速度发送,并在其中放入有用的信息
  • 警报方法中最常见的反模式是发送过多的警报。对于监控来说,过多的警报相当于“狼来了”这样的故事
  • 通常发送过多警报的原因可能包括
  1. 警报缺少可操作性,它只是提供信息。你应关闭所有这些警报,或将其转换为计算速率的计数器,而不是发出警报
  2. 故障的主机或服务上游会触发其下游的所有内容的警报。你应确保警报系统识别并抑制这些重复的相邻警报
  3. 对原因而不是症状(symptom)进行警报。症状是应用程序停止工作的迹象,它们可能是由许多原因导致的各种问题的表现。API或网站的高延迟是一种症状,这种症状可能由许多问题导致:高数据库使用率、内存问题、磁盘性能等。对症状发送警报可以识别真正的问题。仅对原因(例如高数据库使用率)发出警报也可能识别出问题(但通常很可能不会)。对于这个应用程序,高数据库使用率可能是完全正常的,并且可能不会对最终用户或应用程序造成性能问题。作为一个内部状态,发送警报是没有意义的。这种警报可能会导致工程师错过更重要的问题,因为他们已经对大量不可操作且基于原因的警报变得麻木。你应该关注基于症状的警报,并依赖你的指标或其他诊断数据来确定原因
  • 第二种最常见的反模式是警报的错误分类。有时,这也意味着重要的警报会隐藏在其他警报中。但有时候,警报会被发送到错误的地方或者带有不正确的紧急情况说明
  • 第三种反模式是发送无用的警报
  • 代码清单:Nagios警报
  • 这个通知似乎提供了一些有用的信息,但事实并非如此。这是由存储空间突增导致的?还是逐渐增长的结果?增长速度是多少?1GB分区上9%的可用磁盘空间与1TB磁盘上9%的可用磁盘空间完全不同。我们可以忽略或静音这类通知吗?还是需要立即采取行动?
  • 良好的警报应该具备以下几个关键特征:
  1. 适当数量的警报,关注症状而不是原因。噪声警报会导致警报疲劳,最终警报会被忽略。修复警报不足比修复过度警报更容易
  2. 应设置正确的警报优先级。如果警报是紧急的,那么它应该快速路由到负责响应的一方。如果警报不紧急,那么我们应该以适当的速度发送警报,以便在需要时做出响应
  3. 警报应包括适当的上下文,以便它们立即可以使用
  • 提示:Google SRE手册中有一个很棒的关于警报的章节

6.2 Alertmanager如何工作

  • Alertmanager处理从客户端发来的警报(https://prometheus.io/docs/alerting/alertmanager/),客户端通常是Prometheus服务器。Alertmanager对警报进行去重、分组,然后路由到不同的接收器,如电子邮件、短信或SaaS服务(PagerDuty等)
  • Alertmanager架构
  • 警报规则(https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)将使用收集的指标并在指定的阈值或标准上触发警报。我们还将看到如何为警报添加一些上下文。当指标达到阈值或标准时,会生成一个警报并将其推送到Alertmanager。警报在Alertmanager上的HTTP端点上接收。一个或多个Prometheus服务器可以将警报定向到单个Alertmanager,或者你可以创建一个高可用的Alertmanager集群

6.3 安装Alertmanager

  • Alertmanager二进制:https://github.com/prometheus/alertmanager
  • Prometheus.io下载页面:https://prometheus.io/download/#alertmanager
  • GitHub Releases页面:https://github.com/prometheus/alertmanager/releases
  • amtool二进制文件用于帮助管理Alertmanager
  • 代码清单:移动amtool二进制文件
sudo cp alertmanager /usr/local/bin
sudo cp amtool /usr/local/bin
sudo chmod -R 777 /usr/local/bin/{alertmanager,amtool}
  • 监控套件安装方式
  1. 使用Docker Compose安装Prometheus、Node Exporter和Grafana(https://github.com/vegasbrianc/prometheus)
  2. 使用Docker Compose单节点安装Prometheus、Alertmanager、Node Exporter和Grafana(https://github.com/danguita/prometheus-monitoring-stack)
  3. 使用Docker Swarm安装Prometheus(https://github.com/stefanprodan/swarmprom)

6.4 配置Alertmanager

  • Alertmanager配置也是基于YAML的配置文件(https://prometheus.io/docs/alerting/configuration/)
  • 代码清单:一个简单的alertmanager.yml配置文件
global:
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@example.com'
  smtp_require_tls: false

template:
  - '/etc/alertmanager/template/*.tmpl'

route:
  receiver: email

receivers:
  - name: 'email'
    email_configs:
      - to: 'alerts@example.com'
  • template块包含保存警报模板的目录列表(https://prometheus.io/docs/alerting/notifications/)。由于Alertmanager可以发送到各种目的地,因此你通常需要能够自定义警报的外观及其包含的数据。现在让我们创建这个目录
  • 代码清单:创建模板目录
sudo mkdir -p /etc/alertmanager/template
  • route块会告诉Alertmanager如何处理特定的传入警报。警报根据规则进行匹配然后采取相应的操作。你可以把路由想象成有树枝的树,每个警报都从树的根(基本路由或基本节点)进入。除了基本节点之外,每个路由都有匹配的标准,这些标准应该匹配所有警报。然后,你可以定义子路由或子节点,它们是树的分支,对某些特定的警报感兴趣,或者会采取某些特定的操作
  • Alertmanager路由
  • 在当前的配置中,我们只定义了基本路由,即树的根节点。在后面,我们将利用路由来确保警报具有正确的容量、频率和目的地
  • 对于电子邮件警报,我们使用email_configs块来指定电子邮件选项,例如接收警报的地址。我们还可以指定SMTP设置(将覆盖 全局设置),并添加其他条目(例如邮件标头)
  • 提示:有一个称为Webhook接收器的内置接收器,你可以使用它将警报发送到Alertmanager中没有特定接收器的其他目的地(https://prometheus.io/docs/alerting/configuration/#webhook_config)

6.5 运行Alertmanager

  • Alertmanager作为一个Web服务运行,默认端口为9093。启动Alertmanager
  • 代码清单:启动Alertmanager
alertmanager --config.file alertmanager.yml
  • Alertmanager提供了一个Web界面:http://localhost:9093
  • 你可以使用此界面来查看当前警报并管理维护窗口,以及警报抑制(在Prometheus术语中称为silence)
  • 提示:随Alertmanager一起附带的还有一个命令行工具amtool,允许你查询警报、管理silence和使用Alertmanager服务器等

6.6 为Prometheus配置Alertmanager

  • 我们在prometheus.yml配置文件中使用了默认的Alertmanager配置,它包含在alerting块中。让我们先来看看默认配置
  • 代码清单:alerting块
alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - alertmanager: 9093
  • alerting块包含允许Prometheus识别一个或多个Alertmanager的配置。为此,Prometheus使用与查找抓取目标时相同的发现机制,在默认配置中是static_configs。与监控作业一样,它指定目标列表,此处是主机名alertmanager加端口9093(Alertmanager默认端口)的形式。该列表假定你的Prometheus服务器可以解析alertmanager主机名为IP地址,并且Alertmanager在该主机的端口9093上运行
  • 提示:你还可以在Prometheus Web界面上的状态页面http://localhost:9090/status中查看任何已配置的Alertmanager

6.6.1 Alertmanager服务发现

  • 代码清单:发现Alertmanager
alerting:
  alertmanagers:
  - dns_sd_configs:
    - names: ['_alertmanager._tcp.example.com']
  • 在这里,Prometheus将查询_alertmanager.__tcp.example.com SRV记录以获得Alertmanager的主机名
  • 提示:你需要重新加载或重新启动Prometheus,以启用Alertmanager配置

6.6.2 监控Alertmanager

  • 与Prometheus一样,Alertmanager暴露了自身的相关指标
  • 代码清单:监控Alertmanager的Prometheus作业
- job_name: 'alertmanager'
  static_configs:
    - targets: ['localhost:9093']
  • 这将从http://localhost:9093/metrics收集指标并抓取一系列以alertmanager_为前缀的时间序列数据。这些数据包括按状态分类的警报计数、按接收器分类的成功和失败通知的计数

6.7 添加警报规则

  • 现在Alertmanager已经配置完成,让我们添加第一条警报规则。根据使用的节点查询来创建警报,并使用up指标来创建一些基本的可用性警报
  • 提示:你可以在同一文件中同时保存记录规则和警报规则,但为了功能清晰明确,建议将它们放在单独的文件中
  • 代码清单:创建警报规则文件
cd rules
touch node_alerts.yml
  • 不需要单独将此文件添加到prometheus.yml配置文件中的rule_files块,可以使用globbing通配符加载该目录中以_rules.yml或__alerts.yml结尾的所有文件
  • 代码清单:在rule_files块中添加globbing通配符
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"

6.7.1 添加第一条警报规则

  • 让我们添加第一条规则:一个CPU警报规则(https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)。我们将创建一个警报,如果我们创建的CPU查询(5分钟内的节点平均CPU使用率)在至少60分钟内超过80%,则会触发警报
  • 代码清单:第一条警报规则
groups:
- name: node_alerts
  rules:
  - alert: HighNodeCPU
    expr: instance:node_cpu:avg_rate5m > 80
    for: 60m
    labels:
      severity: warning
    annotations:
      summary: High Node CPU of {{ humanize $value}}% for 1 hour
      console: You might want to check the Node Dashboard at http://grafana.example.com/dashboard/db/node-dashboard
  • 与记录规则一样,警报规则也可以组合在一起。我们已经指定了一个组名node_alerts,该组中的规则包含在rules块中。在每个警报组中,警报名称都必须是唯一的
  • 我们还有触发警报的测试或表达式,这在expr子句中指定。测试表达式使用我们用记录规则创建的instance:node_cpu:avg_rate5m指标
instance:node_cpu:avg_rate5m > 80
  • 下一个子句for,控制在触发警报之前测试天工必须为true的时间长度。在示例中,指标instance:node_cpu:avg_rate5m需要在触发警报之前的60分钟内大于80%。这限制了警报误报或是暂时状态的可能性
  • 最后,我们可以使用标签(label)或注解(annotation)来装饰警报。警报规则中时间序列上的所有标签都会转移到警报。labels子句允许我们指定要附加到警报的其他标签,这里我们添加了一个值 为warning的severity标签
  • 警报上的标签与警报的名称相结合,构成警报的标识。这与时间序列相同,其中指标名称和标签构成时间序列的标识
  • annotations子句允许我们指定展示更多信息的标签,如描述、运行手册的链接或处理警报的说明。我们添加了一个名为summary的标签来描述警报,还添加了一个名为console的注解,它指向了展示节点指标的Grafana仪表板。这是一个用注解提供上下文的很好的例子。
  • 重新启动Prometheus后,你将能够在Prometheus Web界面http://localhost:9090/alerts中看到新的警报
  • Prometheus警报报表

6.7.2 警报触发

  • Prometheus以一个固定时间间隔来评估所有规则,这个时间由evaluate_interval定义,我们将其设置为15秒。在每个评估周期,Prometheus运行每个警报规则中定义的天工并更新警报状态
  • 警报可能有以下三种状态:Pending到Firing的转换可以确保警报更有效,且不会来回浮云。没有for子句的警报会自动从Inactive转换为Firing,只需要一个评估周期即可触发。带有for子句的警报将首先转换为Pending,然后转换为Firing,因此至少需要两个评估周期才能触发

名称

描述

Inactive

警报未激活

Pending

警报已满足测试表达式条件,但仍在等待for子句中指定的持续时间

Firing

警报已满足测试表达式条件,并且Pending的时间已超过for子句的持续时间

  • 警报的生命周期
  1. 节点的CPU不断变化,每隔一段由scrape_interval定义的时间被Prometheus抓取一次,对我们来说是15秒
  2. 根据每个evaluation_interval的指标来评估警报规则,对我们来说还是15秒
  3. 当警报表达式为true时(对于我们来说是CPU超过80%),会创建一个警报并转换到Pending状态,执行for子句
  4. 在下一个评估周期中,如果警报测试表达式仍然为true,则检查for的持续时间。如果超过了持续时间,则警报将转换为Firing,生成通知并将其推送到Alertmanager
  5. 如果警报测试表达式不再为true,则Prometheus会将警报规则的状态从Pending更改为Inactive

6.7.3 Alertmanager的警报

  • 警报现在处于Firing状态,并且已将通知推送到Alertmanager。可以在Prometheus Web界面http://localhost:9090/alerts中看到这个警报及其状态
  • 注意:Alertmanager API在/api/v1/alerts路径接收警报
  • Prometheus将为Pending和Firing状态中的每个警报创建指标,这个指标被称为ALERT,并且会像HighNodeCPU警报示例那样构建
  • 代码清单:ALERT时间序列
ALERTP{alertname="HighNodeCPU",alertstate="firing",severity=warning,instance="138.197.26.39:9100"}
  • 通知将被发送到Prometheus配置中定义的Alertmanager,在示例中是alertmanager主机和端口9093,通知被推送到如下HTTP端点
http://alertmanager:9093/api/v1/alerts
  • 假设一个HighNodeCPU警报被触发了,我们将能在Alertmanager Web控制台http://alertmanager:9093/#/alerts上看到该警报
  • Alertmanager中触发的警报
  • HighNodeCPU警报邮件

6.7.4 添加新警报和模板

模板
  • 模板(template)是一种在警报中使用时间序列数据的标签和值 的方法,可用于注解和标签。模板使用标准的Go模板语法,并暴露一些包含时间序列的标签和值的变量。标签以变量$labels形式表示,指标的值则是变量$value
  • 提示:变量$labels和$value分别是底层Go变量.Labels和.Value的名称
  • 要在summary注解中引用instance标签,我们使用{{$labels.instance}}。如果想要引用时间序列的值 ,我们会使用{{$value}},参考文档
    • https://prometheus.io/docs/prometheus/latest/configuration/template_reference/
    • https://prometheus.io/docs/prometheus/latest/configuration/template_reference/#numbers
Prometheus警报
  • 我们应始终牢记:Prometheus服务器也可能出问题。让我们添加一些规则来识别问题并对它们发出警告
  • 代码清单:rules/prometheus_alerts.yml文件
groups:
- name: prometheus_alerts
  rules:
  - alert: PrometheusConfigReloadFailed
    expr: prometheus_config_last_reload_successful == 0
    for: 10m
    labels:
      severity: warning
    annotations:
      description: Reloading Prometheus' configuration has failed on {{ $labels.instance }}.
  - alert: PrometheusNotConnectedToAlertmanagers
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 10m
    labels:
      severity: warning
    annotations:
      description: Prometheus {{ $labels.instance }} is not connected to any Alertmanagers
  • 第一个是PrometheusConfigReloadFailed,它让我们知道Prometheus配置重新加载是否失败。如果上次重新加载失败,则使用指标prometheus_config_last_reload_successful,且指标的值 为0
  • 第二条规则确保Prometheus服务器可以发现Alertmanager。这使用prometheus_notifications_alertmanagers_discovered指标,该指标是服务器找到的Alertmanager计数。如果小于1,则表面Prometheus没有发现任何Alertmanager,并且这个警报将会触发。由于没有任何Alertmanager,因此它只会显示在Prometheus控制台的/alerts页面上
可用性警报
  • 最后的警报可以帮助我们确定主机和服务的能力。第一个警报利用了我们使用Node Exporter收集的systemd指标。如果我们在节点上监控的服务不再活动,则会生成一个警报
  • 代码清单:节点服务警报
- alert: NodeServiceDown
  expr: node_systemd_unit_state{state="active"} != 1
  for: 60s
  labels:
    severity: critical
  annotations:
    summary: Service {{ $labels.name }} on {{ $labels.instance }} is no longer active!
    description: Werner Heisenberg says - "OMG Where's my service?"
  • 如果带有active标签的node_systemd_unit_state指标值为0,则会触发此警报,表示服务故障至少60秒
  • 警报会检测up指标的值是否为0,如果是0则表示抓取失败
up{job="node"} == 0
  • 我们在severity标签中添加了一个新值 critical,并添加了一个模板注解,以帮助指示哪个实例和作业失败
  • 在许多情况下,知道单个实例宕机实际上并不是非常重要。相反,我们也可以测试多个失败的实例,例如,失败实例的百分比
avg(up) by (job) <= 0.50
  • 这个测试表达式计算出up指标的平均值然后按job聚合,并在该值低于50%时触发。如果作业中50%的实例无法完成抓取,则会触发警报
  • 另一种方法可能是
sum by job(up) / count(up) <= 0.8
  • 我们根据job对up指标求和,然后将其除以计数,如果结果大于或等于0.8,或者特定作业中20%的实例未启动,则触发警报
  • 通过确定目标何时消失,我们可以使up警报稍微健壮一些。例如,如果从服务发现中删除我们的目标,那么它的指标将不再更新。如果所有目标都从服务发现中消失,则不会记录任何指标,因此up警报不会被触发。Prometheus有一个功能叫absent,可检测是否存在缺失的指标
  • 代码清单:up指标缺失警报
- alert: InstancesGone
  expr:  absent(up{job="node"})
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: Host {{ $labels.instance }} is no longer reporting!
    description: Werner Heisenberg says - "OMG Where are my instances?"

6.8 路由

  • 路由是一棵树,顶部的默认路由总会被配置,并匹配任何子路由不匹配的内容
  • 代码清单:添加路由配置
route:
  group_by: ['instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
    routes:
      - match:
          service: application1
        receiver: support_team
  - match_re:
      severity: ^(informational|warning)$
    receiver: support_team
receivers:
- name: 'email'
  email_configs:
  - to: 'alerts@example.com'
- name: 'support_team'
  email_configs:
  - to: 'support@example.com'
- name: 'pager'
  email_configs:
  - to: 'alert-pager@example.com'
  slack_configs:
  - api_url: https://hooks.slack.com/services/ABC123/ABC123/EXAMPLE
    text: '{{ .CommonAnnotations.summary }}'
  • 默认情况下,所有警报都组合在一起,但如果我们指定了group_by和任何标签,则Alertmanager将按这些标签对警报进行分组
  • 代码清单:分组
route:
  group_by: ['service', 'cluster']
  • 注意:这仅适用于标签,不适用于注解
  • 分组还会更改Alertmanager的行为。如果引发了新警报,那么Alertmanager将等待下一个选项group_wait中指定的时间段,以便在触发警报之前查看是否收到该组中的其他警报。你可以将其视为警报缓冲
  • 在发出警报后,如果收到来自该分组的下一次评估的新警报,那么Alertmanager将等待group_interval选项中指定的时间段(即5分钟),然后再发送新警报
  • 我们还指定了repeat_interval。这个暂停并不适用于我们的警报组,而是适用于单个警报,并且是等待重新发送相同警报的时间段,我们指定为3个小时

路由表

  • 这里有两种匹配方法:标签匹配和正则表达式匹配。match选项执行简单的标签匹配
  • 代码清单:标签匹配
match:
  severity: critical
  • 由于路由都是分支,因此,如果需要我们也可以再次分支路由,如:
  • 代码清单:路由分支
routes:
- match:
    severity: critical
  receiver: pager
  routes:
    - match:
        service: application1
      receiver: support_team
  • 我们的警报首先需要severity标签为critical,并且service标签是application1。如果这两个条件都匹配,那么我们的警报将被路由到接收器support_team
  • 我们可以使用continue选项来覆盖此行为,该选项控制警报是否先遍历路由,然后再返回以遍历路由树
  • 代码清单:路由分支
routes:
- match:
    severity: critical
  receiver: pager
  continue: true
  • continue选项默认为false,但如果设置为true,则警报将在此路由中触发(如果匹配),并继续执行下一个相邻路由。有时这对于向两个地方发送警报很有用,但更好的解决方法是在接收器中指定多个端点
  • 代码清单:接收器中的多个端点
receivers:
- name: 'email
  email_configs:
  - to: 'alerts@example.com'
  pagerduty_configs:
  - service_key: TEAMKEYHERE
  • 通过在接收器配置中将send_resolved选项设置为true,可以使用Alertmanager发送它们。通常不建议发送这些已解决的警报,因为其可能导致“错误警报”的循环,进而导致警报疲劳,所以在启用之前要仔细考虑
  • 第二个路由使用match_re选项将正则表达式与标签匹配,正则表达式使用severity标签
  • 代码清单:正则表达式匹配
- match_re:
    severity: ^(information|warning)$
  receiver: support_team
  • 注意:Prometheus和Alertmanager正则天工都是完全描述的

6.9 接收器或通知模板

  • 添加一个非电子邮件的接收器,我们添加Slack接收器,它会消息发送到Slack实例(https://prometheus.io/docs/alerting/configuration/#slack_config)
  • 代码清单:添加Slack接收器
receivers:
- name: 'pager'
  email_configs:
  - to: 'alert-pager@example.com'
  slack_configs:
  - api_url: https://hooks.slack.com/services/ABC123/ABC123/EXAMPLE
    text: '{{ .CommonAnnotations.summary }}'
EXAMPLE
    channel: #monitoring
  • Alertmanager发送给Slack的一般警报消息非常简单。你可以在其源代码中看到Alertmanager使用的默认模板,该模板包含电子邮件和其他接收器的默认值 ,但是我们可以为许多接收器覆盖这些值 。例如,可以在Slack警报中添加文本行
{{ define "__alertmanager" }}AlertManager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Annotations:
{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Source: {{ .GeneratorURL }}
{{ end }}{{ end }}


{{ define "slack.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "slack.default.username" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "slack.default.fallback" }}{{ template "slack.default.title" . }} | {{ template "slack.default.titlelink" . }}{{ end }}
{{ define "slack.default.callbackid" }}{{ end }}
{{ define "slack.default.pretext" }}{{ end }}
{{ define "slack.default.titlelink" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "slack.default.iconemoji" }}{{ end }}
{{ define "slack.default.iconurl" }}{{ end }}
{{ define "slack.default.text" }}{{ end }}
{{ define "slack.default.footer" }}{{ end }}


{{ define "hipchat.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "hipchat.default.message" }}{{ template "__subject" . }}{{ end }}


{{ define "pagerduty.default.description" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.client" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "pagerduty.default.clientURL" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "pagerduty.default.instances" }}{{ template "__text_alert_list" . }}{{ end }}


{{ define "opsgenie.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "opsgenie.default.description" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "opsgenie.default.source" }}{{ template "__alertmanagerURL" . }}{{ end }}


{{ define "wechat.default.message" }}{{ template "__subject" . }}
{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{- end }}
AlertmanagerUrl:
{{ template "__alertmanagerURL" . }}
{{- end }}
{{ define "wechat.default.to_user" }}{{ end }}
{{ define "wechat.default.to_party" }}{{ end }}
{{ define "wechat.default.to_tag" }}{{ end }}
{{ define "wechat.default.agent_id" }}{{ end }}



{{ define "victorops.default.state_message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "victorops.default.entity_display_name" }}{{ template "__subject" . }}{{ end }}
{{ define "victorops.default.monitoring_tool" }}{{ template "__alertmanager" . }}{{ end }}

{{ define "email.default.subject" }}{{ template "__subject" . }}{{ end }}
{{ define "email.default.html" }}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
Style and HTML derived from https://github.com/mailgun/transactional-email-templates


The MIT License (MIT)

Copyright (c) 2014 Mailgun
-->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<head style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<meta name="viewport" content="width=device-width" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
<title style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">{{ template "__subject" . }}</title>

</head>

<body itemscope="" itemtype="http://schema.org/EmailMessage" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; -webkit-font-smoothing: antialiased; -webkit-text-size-adjust: none; height: 100%; line-height: 1.6em; width: 100% !important; background-color: #f6f6f6; margin: 0; padding: 0;" bgcolor="#f6f6f6">

<table style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; background-color: #f6f6f6; margin: 0;" bgcolor="#f6f6f6">
  <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
    <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0;" valign="top"></td>
    <td width="600" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; display: block !important; max-width: 600px !important; clear: both !important; width: 100% !important; margin: 0 auto; padding: 0;" valign="top">
      <div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; max-width: 600px; display: block; margin: 0 auto; padding: 0;">
        <table width="100%" cellpadding="0" cellspacing="0" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; border-radius: 3px; background-color: #fff; margin: 0; border: 1px solid #e9e9e9;" bgcolor="#fff">
          <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
            <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 16px; vertical-align: top; color: #fff; font-weight: 500; text-align: center; border-radius: 3px 3px 0 0; background-color: #E6522C; margin: 0; padding: 20px;" align="center" bgcolor="#E6522C" valign="top">
              {{ .Alerts | len }} alert{{ if gt (len .Alerts) 1 }}s{{ end }} for {{ range .GroupLabels.SortedPairs }}
                {{ .Name }}={{ .Value }}
              {{ end }}
            </td>
          </tr>
          <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
            <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 10px;" valign="top">
              <table width="100%" cellpadding="0" cellspacing="0" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                  <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
                    <a href="{{ template "__alertmanagerURL" . }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #FFF; text-decoration: none; line-height: 2em; font-weight: bold; text-align: center; cursor: pointer; display: inline-block; border-radius: 5px; text-transform: capitalize; background-color: #348eda; margin: 0; border-color: #348eda; border-style: solid; border-width: 10px 20px;">View in {{ template "__alertmanager" . }}</a>
                  </td>
                </tr>
                {{ if gt (len .Alerts.Firing) 0 }}
                <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                  <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
                    <strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">[{{ .Alerts.Firing | len }}] Firing</strong>
                  </td>
                </tr>
                {{ end }}
                {{ range .Alerts.Firing }}
                <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                  <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
                    <strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Labels</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                    {{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
                    {{ if gt (len .Annotations) 0 }}<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Annotations</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
                    {{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
                    <a href="{{ .GeneratorURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #348eda; text-decoration: underline; margin: 0;">Source</a><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                  </td>
                </tr>
                {{ end }}

                {{ if gt (len .Alerts.Resolved) 0 }}
                  {{ if gt (len .Alerts.Firing) 0 }}
                <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                  <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
                    <br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                    <hr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                    <br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                  </td>
                </tr>
                  {{ end }}
                <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                  <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
                    <strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">[{{ .Alerts.Resolved | len }}] Resolved</strong>
                  </td>
                </tr>
                {{ end }}
                {{ range .Alerts.Resolved }}
                <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
                  <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
                    <strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Labels</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                    {{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
                    {{ if gt (len .Annotations) 0 }}<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Annotations</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
                    {{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
                    <a href="{{ .GeneratorURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #348eda; text-decoration: underline; margin: 0;">Source</a><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
                  </td>
                </tr>
                {{ end }}
              </table>
            </td>
          </tr>
        </table>

        <div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; clear: both; color: #999; margin: 0; padding: 20px;">
          <table width="100%" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
            <tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
              <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 12px; vertical-align: top; text-align: center; color: #999; margin: 0; padding: 0 0 20px;" align="center" valign="top"><a href="{{ .ExternalURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 12px; color: #999; text-decoration: underline; margin: 0;">Sent by {{ template "__alertmanager" . }}</a></td>
            </tr>
          </table>
        </div></div>
    </td>
    <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0;" valign="top"></td>
  </tr>
</table>

</body>
</html>

{{ end }}

{{ define "pushover.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "pushover.default.message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 }}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "pushover.default.url" }}{{ template "__alertmanagerURL" . }}{{ end }}
  • 代码清单:添加Slack接收器
slack_configs:
  - api_url: https://hooks.slack.com/services/ABC123/ABC123/EXAMPLE
    text: '{{ .CommonAnnotations.summary }}'
    channel: #monitoring
    text: '{{ .CommonAnnotations.summary }}'
  • Alertmanager自定义通知使用Go模板语法。警报中包含的数据也通过变量暴露。我们使用CommonAnnotations变量,该变量包含一组警报通用的注解集
  • 提示:Alertmanager文档通知模板变量完整参考(https://prometheus.io/docs/alerting/notifications/)
  • 我们还可以使用Go template函数来引用外部模板,从而避免在配置文件中嵌入较长且复杂的字符串,目录位于/etc/alertmanager/templates
  • 代码清单:创建模板文件/etc/alertmanager/template/slack.tmpl
{{ define "slack.example.text"}}{{ .CommonAnnotations.summary }}{{ end }}
  • 这里我们使用define函数定义了一个新模板,以end结尾,并取名为slack.example.text,然后在模板内的text中复制内容。我们现在可以在Alertmanager配置中引用该模板
  • 代码清单:添加Slack接收器
slack_configs:
- api_url: https://hooks.slack.com/services/ABC123/ABC123/EXAMPle
  channel: #monitoring
  text: '{{ template "slack.example.text" . }}'
  • 提示:Alertmanager文档中还有一些其他的通知模板示例(https://prometheus.io/docs/alerting/notification_examples/)

6.10 silence和维护

  • 通常我们需要让警报系统道我们已经停止服务以进行维护,并且不希望触发警报。或者,当上游出现问题时,我们需要将下游服务和应用程序“静音”。Prometheus称这种警报静音为silence。silence可以设定为特定时期,例如一小时,或者是一个时间窗口(如直到今天午夜)。这是silence的到期时间或到期日期。如果需要,我们也可以提前手动让silence过期(如果我们的维护比计划提前完成)
  • 两种方法来设置silence
  1. 通过Alertmanager Web控制台
  2. 通过amtool命令行工具

6.10.1 通过Alertmanager控制silence

  • 第一种方法使用Web界面
  • 通过使用标签匹配警报来识别要静音的警报,就像警报路由一样。你可以使用直接匹配,例如匹配具有特定值 的标签的每个警报,或者可以使用正则表达式匹配。你还需要指定指定silence的创建者和注释,以说明警报被静音的原因
  • 新建silence
  • 也可以使用Preview Alerts来查看是否有警报会被静音
  • 注意:你也可以参考使用一个名为Unsee的Alertmanager控制台(https://github.com/cloudflare/unsee)
  • 编辑silence或者使其过期

6.10.2 通过amtool控制silence

  • 第二种方法是使用amtool命令行。amtool二进制文件随Alertmanager安装tar包附带
  • 代码清单:使用amtool设置silence
amtool --alertmanager.url=http://localhost:9093 silence add alertname=InstancesGone service=application1 784ac68d-33ce-4e9b-8b95-431a1e0fc268
  • 这将在Alertmanager的http://localhost:9093上添加一个新silence,它将警报与两个标签匹配:自动填充包含警报名称的alertname标签;以及我们设置的service标签
  • 提示:使用amtool创建的silence被设置为一小时后自动过期,可以使用--expires和--expire-on参数来指定更长的时间或窗口
  • 将返回一个silence ID
784ac68d-33ce-4e9b-8b95-431a1e0fc268
  • 我们可以使用query子命令来查询当前silence列表
  • 代码清单:查询silence
amtool --alertmanager.url=http://localhost:9093 silence query
  • 代码清单:使silence过期
amtool --alertmanager.url=http://localhost:9093 silence expire 784ac68d-33ce-4e9b-8b95-431a1e0fc268
  • 你可以为某些选项创建一个YAML配置文件,而不必每次都指定--alertmanager.url参数。amtool查找的默认配置文件路径是$HOME/.config/amtool/config.yml或/etc/amtool/config.yml
  • 代码清单:amtool配置文件示例
alertmanager.url: "http://localhost:9093"
author: sre@example.com
comment_required: true
  • 回到silence的创建,在创建silence时,你还可以使用正则表达式作为标签值
  • 代码清单:使用amtool来安排silence
amtool silence add --comment "App1 maintenance" alertname=~'Instance.*' service=aplication1
  • 这里我们使用=~来表示标签匹配是一个正则表达式,并在所有警报上匹配一个以Instance开头的alertname。我们还使用了--comment参数来添加有关警报的信息
  • 代码清单:忽略alertname
amtool silence add --author "James" --duration "2h" alertname=InstancesGone service=applicationn1
  • 我们用--author参数覆盖了silence的创建者,并将持续时间指定为两个小时,而不是默认的一小时
  • 提示:amtool还允许我们使用Alertmanager并验证其配置文件,以及其他有用的任务。你可以通过使用--help参数查看命令行参数的完整列表,还可以通过amtool silence --help获得特定子命令的帮助。此外,你可以使用Github上的说明完成命令行自动补全和帮助页生成
Alertmanager




The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integrations such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

- Documentation

Install

There are various ways of installing Alertmanager.

Precompiled binaries

Precompiled binaries for released versions are available in the
download section
on prometheus.io. Using the latest production release binary
is the recommended way of installing Alertmanager.

Docker images

Docker images are available on Quay.io.

Compiling the binary

You can either go get it:

    $ GO15VENDOREXPERIMENT=1 go get github.com/prometheus/alertmanager/cmd/...
    # cd $GOPATH/src/github.com/prometheus/alertmanager
    $ alertmanager --config.file=<your_file>

Or clone the repository and build manually:

    $ mkdir -p $GOPATH/src/github.com/prometheus
    $ cd $GOPATH/src/github.com/prometheus
    $ git clone https://github.com/prometheus/alertmanager.git
    $ cd alertmanager
    $ make build
    $ ./alertmanager --config.file=<your_file>

You can also build just one of the binaries in this repo by passing a name to the build function:

    $ make build BINARIES=amtool

Example

This is an example configuration that should cover most relevant aspects of the new YAML configuration format. The full documentation of the configuration can be found here.

    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: 'alertmanager@example.org'
    
    # The root route on which each incoming alert enters.
    route:
      # The root route must not have any matchers as it is the entry point for
      # all alerts. It needs to have a receiver configured so alerts that do not
      # match any of the sub-routes are sent to someone.
      receiver: 'team-X-mails'
    
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster']
    
      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s
    
      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m
    
      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h
    
      # All the above attributes are inherited by all child routes and can
      # overwritten on each.
    
      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
    
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
    
      - match:
          service: files
        receiver: team-Y-mails
    
        routes:
        - match:
            severity: critical
          receiver: team-Y-pager
    
      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
    
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
    
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
    
        - match:
            owner: team-Y
          receiver: team-Y-pager
    
    
    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      equal: ['alertname']
    
    
    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'
    
    - name: 'team-X-pager'
      email_configs:
      - to: 'team-X+alerts-critical@example.org'
      pagerduty_configs:
      - routing_key: <team-X-key>
    
    - name: 'team-Y-mails'
      email_configs:
      - to: 'team-Y+alerts@example.org'
    
    - name: 'team-Y-pager'
      pagerduty_configs:
      - routing_key: <team-Y-key>
    
    - name: 'team-DB-pager'
      pagerduty_configs:
      - routing_key: <team-DB-key>

API

The current Alertmanager API is version 2. This API is fully generated via the
OpenAPI project
and Go Swagger with the exception
of the HTTP handlers themselves. The API specification can be found in
api/v2/openapi.yaml. A HTML rendered version can be
accessed here.
Clients can be easily generated via any OpenAPI generator for all major languages.

With the default config, endpoints are accessed under a /api/v1 or /api/v2 prefix.
The v2 /status endpoint would be /api/v2/status. If --web.route-prefix is set then API routes are
prefixed with that as well, so --web.route-prefix=/alertmanager/ would
relate to /alertmanager/api/v2/status.

API v2 is still under heavy development and thereby subject to change.

amtool

amtool is a cli tool for interacting with the Alertmanager API. It is bundled with all releases of Alertmanager.

Install

Alternatively you can install with:

    go get github.com/prometheus/alertmanager/cmd/amtool

Examples

View all currently firing alerts:

    $ amtool alert
    Alertname        Starts At                Summary
    Test_Alert       2017-08-02 18:30:18 UTC  This is a testing alert!
    Test_Alert       2017-08-02 18:30:18 UTC  This is a testing alert!
    Check_Foo_Fails  2017-08-02 18:30:18 UTC  This is a testing alert!
    Check_Foo_Fails  2017-08-02 18:30:18 UTC  This is a testing alert!

View all currently firing alerts with extended output:

    $ amtool -o extended alert
    Labels                                        Annotations                                                    Starts At                Ends At                  Generator URL
    alertname="Test_Alert" instance="node0"       link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    alertname="Test_Alert" instance="node1"       link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    alertname="Check_Foo_Fails" instance="node0"  link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    alertname="Check_Foo_Fails" instance="node1"  link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local

In addition to viewing alerts, you can use the rich query syntax provided by Alertmanager:

    $ amtool -o extended alert query alertname="Test_Alert"
    Labels                                   Annotations                                                    Starts At                Ends At                  Generator URL
    alertname="Test_Alert" instance="node0"  link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    alertname="Test_Alert" instance="node1"  link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    
    $ amtool -o extended alert query instance=~".+1"
    Labels                                        Annotations                                                    Starts At                Ends At                  Generator URL
    alertname="Test_Alert" instance="node1"       link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    alertname="Check_Foo_Fails" instance="node1"  link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local
    
    $ amtool -o extended alert query alertname=~"Test.*" instance=~".+1"
    Labels                                   Annotations                                                    Starts At                Ends At                  Generator URL
    alertname="Test_Alert" instance="node1"  link="https://example.com" summary="This is a testing alert!"  2017-08-02 18:31:24 UTC  0001-01-01 00:00:00 UTC  http://my.testing.script.local

Silence an alert:

    $ amtool silence add alertname=Test_Alert
    b3ede22e-ca14-4aa0-932c-ca2f3445f926
    
    $ amtool silence add alertname="Test_Alert" instance=~".+0"
    e48cb58a-0b17-49ba-b734-3585139b1d25

View silences:

    $ amtool silence query
    ID                                    Matchers              Ends At                  Created By  Comment
    b3ede22e-ca14-4aa0-932c-ca2f3445f926  alertname=Test_Alert  2017-08-02 19:54:50 UTC  kellel
    
    $ amtool silence query instance=~".+0"
    ID                                    Matchers                            Ends At                  Created By  Comment
    e48cb58a-0b17-49ba-b734-3585139b1d25  alertname=Test_Alert instance=~.+0  2017-08-02 22:41:39 UTC  kellel

Expire a silence:

    $ amtool silence expire b3ede22e-ca14-4aa0-932c-ca2f3445f926

Expire all silences matching a query:

    $ amtool silence query instance=~".+0"
    ID                                    Matchers                            Ends At                  Created By  Comment
    e48cb58a-0b17-49ba-b734-3585139b1d25  alertname=Test_Alert instance=~.+0  2017-08-02 22:41:39 UTC  kellel
    
    $ amtool silence expire $(amtool silence -q query instance=~".+0")
    
    $ amtool silence query instance=~".+0"
    

Expire all silences:

    $ amtool silence expire $(amtool silence query -q)

Configuration

amtool allows a configuration file to specify some options for convenience. The default configuration file paths are $HOME/.config/amtool/config.yml or /etc/amtool/config.yml

An example configuration file might look like the following:

    # Define the path that `amtool` can find your `alertmanager` instance at alertmanager.url: "http://localhost:9093"
    
    # Override the default author. (unset defaults to your username)
    author: me@example.com
    
    # Force amtool to give you an error if you don't include a comment on a silence
    comment_required: true
    
    # Set a default output format. (unset defaults to simple)
    output: extended
    
    # Set a default receiver
    receiver: team-X-pager

Routes

amtool allows you to visualize the routes of your configuration in form of text tree view.
Also you can use it to test the routing by passing it label set of an alert
and it prints out all receivers the alert would match ordered and separated by ,.
(If you use --verify.receivers amtool returns error code 1 on mismatch)

Example of usage:

    # View routing tree of remote Alertmanager
    $ amtool config routes --alertmanager.url=http://localhost:9090
    
    # Test if alert matches expected receiver
    $ amtool config routes test --config.file=doc/examples/simple.yml --tree --verify.receivers=team-X-pager service=database owner=team-X

High Availability

Alertmanager's high availability is in production use at many companies and is enabled by default.

Important: Both UDP and TCP are needed in alertmanager 0.15 and higher for the cluster to work.

To create a highly available cluster of the Alertmanager the instances need to
be configured to communicate with each other. This is configured using the
--cluster.* flags.

- --cluster.listen-address string: cluster listen address (default "0.0.0.0:9094"; empty string disables HA mode)
- --cluster.advertise-address string: cluster advertise address
- --cluster.peer value: initial peers (repeat flag for each additional peer)
- --cluster.peer-timeout value: peer timeout period (default "15s")
- --cluster.gossip-interval value: cluster message propagation speed
(default "200ms")
- --cluster.pushpull-interval value: lower values will increase
convergence speeds at expense of bandwidth (default "1m0s")
- --cluster.settle-timeout value: maximum time to wait for cluster
connections to settle before evaluating notifications.
- --cluster.tcp-timeout value: timeout value for tcp connections, reads and writes (default "10s")
- --cluster.probe-timeout value: time to wait for ack before marking node unhealthy
(default "500ms")
- --cluster.probe-interval value: interval between random node probes (default "1s")
- --cluster.reconnect-interval value: interval between attempting to reconnect to lost peers (default "10s")
- --cluster.reconnect-timeout value: length of time to attempt to reconnect to a lost peer (default: "6h0m0s")

The chosen port in the cluster.listen-address flag is the port that needs to be
specified in the cluster.peer flag of the other peers.

The cluster.advertise-address flag is required if the instance doesn't have
an IP address that is part of RFC 6980
with a default route.

To start a cluster of three peers on your local machine use goreman and the
Procfile within this repository.

    goreman start

To point your Prometheus 1.4, or later, instance to multiple Alertmanagers, configure them
in your prometheus.yml configuration file, for example:

    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager1:9093
          - alertmanager2:9093
          - alertmanager3:9093

Important: Do not load balance traffic between Prometheus and its Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. The Alertmanager implementation expects all alerts to be sent to all Alertmanagers to ensure high availability.

Turn off high availability

If running Alertmanager in high availability mode is not desired, setting --cluster.listen-address= prevents Alertmanager from listening to incoming peer requests.

Contributing

Check the Prometheus contributing page.

To contribute to the user interface, refer to ui/app/CONTRIBUTING.md.

Architecture



License

Apache License 2.0, see LICENSE.

[hub]: https://hub.docker.com/r/prom/alertmanager/
[circleci]: https://circleci.com/gh/prometheus/alertmanager
[quay]: https://quay.io/repository/prometheus/alertmanager

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-12-12,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 yeedomliu 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 第6章 警报管理
    • 6.1 警报
      • 6.2 Alertmanager如何工作
        • 6.3 安装Alertmanager
          • 6.4 配置Alertmanager
            • 6.5 运行Alertmanager
              • 6.6 为Prometheus配置Alertmanager
                • 6.6.1 Alertmanager服务发现
                • 6.6.2 监控Alertmanager
              • 6.7 添加警报规则
                • 6.7.1 添加第一条警报规则
                • 6.7.2 警报触发
                • 6.7.3 Alertmanager的警报
                • 6.7.4 添加新警报和模板
              • 6.8 路由
                • 路由表
              • 6.9 接收器或通知模板
                • 6.10 silence和维护
                  • 6.10.1 通过Alertmanager控制silence
                  • 6.10.2 通过amtool控制silence
              相关产品与服务
              Prometheus 监控服务
              Prometheus 监控服务(TencentCloud Managed Service for Prometheus,TMP)是基于开源 Prometheus 构建的高可用、全托管的服务,与腾讯云容器服务(TKE)高度集成,兼容开源生态丰富多样的应用组件,结合腾讯云可观测平台-告警管理和 Prometheus Alertmanager 能力,为您提供免搭建的高效运维能力,减少开发及运维成本。
              领券
              问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档