我有一个相当典型的show CPU usage查询
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80
这会导致数据看起来有点像这样:
{instance="opus143.domain.com:9182"} 94.07140535559513
{instance="opus162.domain.com:9182"} 90.00755315803018
{instance="opus163.domain.com:9182"} 85.48084077380952
但我只想查询没有出现在另一个列表中的计算机的值
opus_local_slaves_count > 0
opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
我想我已经通过使用label_replace为我提供了每种情况下的主机,从而获得了部分实现方法
(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "$1","instance","(.*?)[.].*"))
{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952
label_replace((opus_local_slaves_count > 0), "host", "$1","instance","(.*?)[.].*")
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
但现在我真的被困住了,试图从第一个列表中排除第二个列表中的主机。这在PromQL中是可能的吗?在SQL中,它将是一个简单的NOT IN subquery
。
更新:对于上下文,我试图实现的是能够对服务器上的高CPU发出警报,第二个列表中的服务器除外,它应该具有高CPU利用率。也许还有更好的办法?
发布于 2020-05-29 17:40:00
解决了!
对于任何想做类似事情的人来说...沙兰特的关键字是UNLESS!
我首先通过创建录制规则来简化操作:
groups:
- name: custom_rules
rules:
- record: wmi_cpu_time_total_instance
expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
- record: wmi_cpu_time_total_instance_host
expr: label_replace(wmi_cpu_time_total_instance, "host", "$1", "instance","(.*?)[.].*")
- record: opus_local_slaves_count_instance_host
expr: label_replace(opus_local_slaves_count, "host", "$1", "instance","(.*?)[.].*")
它封装了计算和添加主机标签的大部分复杂性,然后我找到了这个博客(谢谢Chris Siebenmann) https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics,它为我指出了UNLESS关键字,这样我就可以编写简单的查询
wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)
它提供没有opus_local_slaves_count标签的主机列表,或者opus_local_slaves_count =0的主机列表
瞧!
https://stackoverflow.com/questions/62066894
复制相似问题